About

Much of data exists in rectangular format with rows and columns. Different terms can be used to describe these kind of data

  1. Table
  2. Data frame
  3. Structured data
  4. Spreadsheets

Pandas is one of the widely used data manipulation library in python for structured datasets. Below is a summary of the key operations that are part of any essential data analysis project(SQL equivalents).

  1. Select column references
  2. Select scalar expression
  3. Where
  4. Group By
  5. Aggregation
  6. Order By
  7. Window functions
  8. Join

When I started using pandas, realized that there are multiple ways to perform the same operation.Also, code I was writing was not as elegant as SQL queries and hard to debug. In this blog post I will share examples of how to perform the above mentioned SQL operations in pandas and write pandas code that is readable and easy to maintain.

import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
pd.options.display.max_rows = 20
df.head(5)
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Select columns

Use loc with list of columns names to perform selection of columns. I would recommend using this syntax as it offers more flexibility in your data analysis task

.loc[:,['col1','col2']]

Select total_bill and tips column from the data. Note: we are using method chaining to perform operations one after another

(df
 .loc[:,['tip','sex']]
 .head()
)
tip sex
0 1.01 Female
1 1.66 Male
2 3.50 Male
3 3.31 Male
4 3.61 Female

Select only columns starting with the letter 't'. Using this simple and readable syntax enables one to perform complex select operations in pandas

(df
 .loc[:,[col for col in df.columns if col.startswith('t')]]
 .head()
)
total_bill tip time
0 16.99 1.01 Dinner
1 10.34 1.66 Dinner
2 21.01 3.50 Dinner
3 23.68 3.31 Dinner
4 24.59 3.61 Dinner

Select columns manipulation

Use assign statement to add new columns, updated existing columns

.assign(new_col=1)
.assign(new_col=lambda x:x['col']+1)
.assign(old_col=lambda x:x['old_col']+1)
(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill']) #add new column
 .assign(tip=lambda x:x['tip']+1) # update existing column 
 .assign(count=1) #add constant value 
 .head()
)
total_bill tip sex day time percentage_tip count
0 16.99 2.01 Female Sun Dinner 0.059447 1
1 10.34 2.66 Male Sun Dinner 0.160542 1
2 21.01 4.50 Male Sun Dinner 0.166587 1
3 23.68 4.31 Male Sun Dinner 0.139780 1
4 24.59 4.61 Female Sun Dinner 0.146808 1

Filter rows (where)

Use query to perform filtering of rows in pandas

val=10
.query("col1>='10'")
.query("col1>='@val'")
.query(f"col1>='{val}'")
.query("col1.isin(['a','b'])",engine='python')
#filter only transaction with more than 15% in tips
(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
 .query("percentage_tip>.15")
 .head()
)
total_bill tip sex day time percentage_tip
1 10.34 1.66 Male Sun Dinner 0.160542
2 21.01 3.50 Male Sun Dinner 0.166587
5 25.29 4.71 Male Sun Dinner 0.186240
6 8.77 2.00 Male Sun Dinner 0.228050
9 14.78 3.23 Male Sun Dinner 0.218539
per_tip=.15
#using @ within query to refer a variable in the filter 
print("")
display(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
 .query("percentage_tip>@per_tip")
 .head()
)

#using f-string to perform filtering
display(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
 .query(f"percentage_tip>{per_tip}")
 .head()
)

total_bill tip sex day time percentage_tip
1 10.34 1.66 Male Sun Dinner 0.160542
2 21.01 3.50 Male Sun Dinner 0.166587
5 25.29 4.71 Male Sun Dinner 0.186240
6 8.77 2.00 Male Sun Dinner 0.228050
9 14.78 3.23 Male Sun Dinner 0.218539
total_bill tip sex day time percentage_tip
1 10.34 1.66 Male Sun Dinner 0.160542
2 21.01 3.50 Male Sun Dinner 0.166587
5 25.29 4.71 Male Sun Dinner 0.186240
6 8.77 2.00 Male Sun Dinner 0.228050
9 14.78 3.23 Male Sun Dinner 0.218539
#Filter only transactions happend on Sunday and Monday
(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .query("day.isin(['Sun','Mon'])",engine='python')
 .head()
)
total_bill tip sex day time
0 16.99 1.01 Female Sun Dinner
1 10.34 1.66 Male Sun Dinner
2 21.01 3.50 Male Sun Dinner
3 23.68 3.31 Male Sun Dinner
4 24.59 3.61 Female Sun Dinner

Group By and Aggregation

Use groupby with named aggs to perform any type of aggregations. Aggregation function are flexible enough that we can pass in lambda function and numpy functions to perform aggregations.

#By day get average and total bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum')) #multiple column aggregations supported
 .reset_index()
)
day avg_bill total_bill
0 Fri 17.151579 325.88
1 Sat 20.441379 1778.40
2 Sun 21.410000 1627.16
3 Thur 17.682742 1096.33
#By day get average of total bill using : functions, lambda functions, numpy functions 
(df
 .groupby(['day'])
 .agg(avg_bill_mean=('total_bill','mean')
     ,avg_bill_lambda=('total_bill',lambda x:x.mean()) #using lambda functions
     ,avg_bill_np=('total_bill',np.mean)) #using numpy functions 
 .reset_index()
)
day avg_bill_mean avg_bill_lambda avg_bill_np
0 Fri 17.151579 17.151579 17.151579
1 Sat 20.441379 20.441379 20.441379
2 Sun 21.410000 21.410000 21.410000
3 Thur 17.682742 17.682742 17.682742

Ordering rows

Most of the data analysis tasks requires sorting as a preprocessing step or as a last step to display output. This can be done in pandas by using sort_values function

Use sort_values to order a pandas data frame along the column/axis specified

.sort_values(['col1','col2'],ascending=[True,False])
#By day get average and total bill.Sort the output by total_bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum'))
 .reset_index()
 .sort_values(['total_bill']) #Default in ascending 
)
day avg_bill total_bill
0 Fri 17.151579 325.88
3 Thur 17.682742 1096.33
2 Sun 21.410000 1627.16
1 Sat 20.441379 1778.40
#By day get average and total bill.Sort the output by total_bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum'))
 .reset_index()
 .sort_values(['total_bill'],ascending=[False]) #By descending order 
)
day avg_bill total_bill
1 Sat 20.441379 1778.40
2 Sun 21.410000 1627.16
3 Thur 17.682742 1096.33
0 Fri 17.151579 325.88
#By day get average and total bill.Sort the output by total_bill and avg_bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum'))
 .reset_index()
 .sort_values(['total_bill','avg_bill'],ascending=[False,True]) #By multiple columns one by asc and other by desc
)
day avg_bill total_bill
1 Sat 20.441379 1778.40
2 Sun 21.410000 1627.16
3 Thur 17.682742 1096.33
0 Fri 17.151579 325.88

Window function

Window functions are very powerful in the SQL world. Here we will learn how to use the following functions: row_number(), Lead()/Lag(), Running sum within each group (partition)

#Equivalent of row_number() over(partition by day order by total_bill asc) as row_number
(df
 .assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
 .sort_values(['row_number'])
 .head()
)
total_bill tip sex smoker day time size row_number
67 3.07 1.00 Female Yes Sat Dinner 1 1
172 7.25 5.15 Male Yes Sun Dinner 2 1
149 7.51 2.00 Male No Thur Lunch 2 1
92 5.75 1.00 Female Yes Fri Dinner 2 1
222 8.58 1.92 Male Yes Fri Lunch 1 2
#Equivalent of lag(total_bill) over(partition by day order by total_bill asc) as previous_bill
(df
 .assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
 .assign(prev_bill=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day'])['total_bill'].shift(1))
 .sort_values(['row_number'])
 .head()
)
total_bill tip sex smoker day time size row_number prev_bill
67 3.07 1.00 Female Yes Sat Dinner 1 1 NaN
172 7.25 5.15 Male Yes Sun Dinner 2 1 NaN
149 7.51 2.00 Male No Thur Lunch 2 1 NaN
92 5.75 1.00 Female Yes Fri Dinner 2 1 NaN
222 8.58 1.92 Male Yes Fri Lunch 1 2 5.75
#Equivalent of lead(total_bill) over(partition by day order by total_bill asc) as previous_bill
(df
 .assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
 .assign(next_bill=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day'])['total_bill'].shift(-1))
 .sort_values(['row_number'])
 .head()
)
total_bill tip sex smoker day time size row_number next_bill
67 3.07 1.00 Female Yes Sat Dinner 1 1 7.25
172 7.25 5.15 Male Yes Sun Dinner 2 1 8.77
149 7.51 2.00 Male No Thur Lunch 2 1 7.56
92 5.75 1.00 Female Yes Fri Dinner 2 1 8.58
222 8.58 1.92 Male Yes Fri Lunch 1 2 10.09
#Equivalent of sum(total_bill) over(partition by day) as sum_bill_day
#Equivalent of sum(tip) over(partition by day order by total bill asc) as cum_tip_day
#Equivalent of sum(tip) over(partition by day order by total. bill rows between 3 preceeding and current row) as rolling_3d_sum 

(df
 .assign(sum_bill_day=lambda x:x.groupby(['day'])['total_bill'].transform('sum'))
 .assign(cum_tip_day=lambda x:x.sort_values(['total_bill']).groupby(['day'])['tip'].cumsum())
 .assign(rolling_3d_sum=lambda x:x.sort_values(['total_bill']).groupby(['day'])['tip'].rolling(2,min_period=1).sum().reset_index(drop=True, level=0))
 .query("day=='Sat'")
 .sort_values(['total_bill'])
 .head()
)
total_bill tip sex smoker day time size sum_bill_day cum_tip_day rolling_3d_sum
67 3.07 1.00 Female Yes Sat Dinner 1 1778.4 1.00 NaN
111 7.25 1.00 Female No Sat Dinner 1 1778.4 2.00 2.00
218 7.74 1.44 Male Yes Sat Dinner 2 1778.4 3.44 2.44
30 9.55 1.45 Male No Sat Dinner 2 1778.4 4.89 2.89
235 10.07 1.25 Male No Sat Dinner 2 1778.4 6.14 2.70

Conclusion

In this blog post I have shared some simple tips that will improve efficiency of any data analysis projects. I plan to update this blog post with more examples to make data analysis in pandas easy.