About

Much of data exists in rectangular format with rows and columns. Different terms can be used to describe these kind of data

Table
Data frame
Structured data
Spreadsheets

Pandas is one of the widely used data manipulation library in python for structured datasets. Below is a summary of the key operations that are part of any essential data analysis project(SQL equivalents).

Select column references
Select scalar expression
Where
Group By
Aggregation
Order By
Window functions
Join

When I started using pandas, realized that there are multiple ways to perform the same operation.Also, code I was writing was not as elegant as SQL queries and hard to debug. In this blog post I will share examples of how to perform the above mentioned SQL operations in pandas and write pandas code that is readable and easy to maintain.

import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
pd.options.display.max_rows = 20

df.head(5)

Select columns

Use loc with list of columns names to perform selection of columns. I would recommend using this syntax as it offers more flexibility in your data analysis task

.loc[:,['col1','col2']]

Select total_bill and tips column from the data. Note: we are using method chaining to perform operations one after another

(df
 .loc[:,['tip','sex']]
 .head()
)

Select only columns starting with the letter 't'. Using this simple and readable syntax enables one to perform complex select operations in pandas

(df
 .loc[:,[col for col in df.columns if col.startswith('t')]]
 .head()
)

Select columns manipulation

Use assign statement to add new columns, updated existing columns

.assign(new_col=1)
.assign(new_col=lambda x:x['col']+1)
.assign(old_col=lambda x:x['old_col']+1)

(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill']) #add new column
 .assign(tip=lambda x:x['tip']+1) # update existing column 
 .assign(count=1) #add constant value 
 .head()
)

Filter rows (where)

Use query to perform filtering of rows in pandas

val=10
.query("col1>='10'")
.query("col1>='@val'")
.query(f"col1>='{val}'")
.query("col1.isin(['a','b'])",engine='python')

#filter only transaction with more than 15% in tips
(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
 .query("percentage_tip>.15")
 .head()
)

per_tip=.15
#using @ within query to refer a variable in the filter 
print("")
display(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
 .query("percentage_tip>@per_tip")
 .head()
)

#using f-string to perform filtering
display(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
 .query(f"percentage_tip>{per_tip}")
 .head()
)

#Filter only transactions happend on Sunday and Monday
(df
 .loc[:,['total_bill','tip','sex','day','time']]
 .query("day.isin(['Sun','Mon'])",engine='python')
 .head()
)

Group By and Aggregation

Use groupby with named aggs to perform any type of aggregations. Aggregation function are flexible enough that we can pass in lambda function and numpy functions to perform aggregations.

#By day get average and total bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum')) #multiple column aggregations supported
 .reset_index()
)

#By day get average of total bill using : functions, lambda functions, numpy functions 
(df
 .groupby(['day'])
 .agg(avg_bill_mean=('total_bill','mean')
     ,avg_bill_lambda=('total_bill',lambda x:x.mean()) #using lambda functions
     ,avg_bill_np=('total_bill',np.mean)) #using numpy functions 
 .reset_index()
)

Ordering rows

Most of the data analysis tasks requires sorting as a preprocessing step or as a last step to display output. This can be done in pandas by using sort_values function

Use sort_values to order a pandas data frame along the column/axis specified

.sort_values(['col1','col2'],ascending=[True,False])

#By day get average and total bill.Sort the output by total_bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum'))
 .reset_index()
 .sort_values(['total_bill']) #Default in ascending 
)

#By day get average and total bill.Sort the output by total_bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum'))
 .reset_index()
 .sort_values(['total_bill'],ascending=[False]) #By descending order 
)

#By day get average and total bill.Sort the output by total_bill and avg_bill
(df
 .groupby(['day'])
 .agg(avg_bill=('total_bill','mean')
     ,total_bill=('total_bill','sum'))
 .reset_index()
 .sort_values(['total_bill','avg_bill'],ascending=[False,True]) #By multiple columns one by asc and other by desc
)

Window function

Window functions are very powerful in the SQL world. Here we will learn how to use the following functions: row_number(), Lead()/Lag(), Running sum within each group (partition)

#Equivalent of row_number() over(partition by day order by total_bill asc) as row_number
(df
 .assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
 .sort_values(['row_number'])
 .head()
)

#Equivalent of lag(total_bill) over(partition by day order by total_bill asc) as previous_bill
(df
 .assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
 .assign(prev_bill=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day'])['total_bill'].shift(1))
 .sort_values(['row_number'])
 .head()
)

#Equivalent of lead(total_bill) over(partition by day order by total_bill asc) as previous_bill
(df
 .assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
 .assign(next_bill=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day'])['total_bill'].shift(-1))
 .sort_values(['row_number'])
 .head()
)

#Equivalent of sum(total_bill) over(partition by day) as sum_bill_day
#Equivalent of sum(tip) over(partition by day order by total bill asc) as cum_tip_day
#Equivalent of sum(tip) over(partition by day order by total. bill rows between 3 preceeding and current row) as rolling_3d_sum 

(df
 .assign(sum_bill_day=lambda x:x.groupby(['day'])['total_bill'].transform('sum'))
 .assign(cum_tip_day=lambda x:x.sort_values(['total_bill']).groupby(['day'])['tip'].cumsum())
 .assign(rolling_3d_sum=lambda x:x.sort_values(['total_bill']).groupby(['day'])['tip'].rolling(2,min_period=1).sum().reset_index(drop=True, level=0))
 .query("day=='Sat'")
 .sort_values(['total_bill'])
 .head()
)

Conclusion

In this blog post I have shared some simple tips that will improve efficiency of any data analysis projects. I plan to update this blog post with more examples to make data analysis in pandas easy.

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

	tip	sex
0	1.01	Female
1	1.66	Male
2	3.50	Male
3	3.31	Male
4	3.61	Female

	total_bill	tip	time
0	16.99	1.01	Dinner
1	10.34	1.66	Dinner
2	21.01	3.50	Dinner
3	23.68	3.31	Dinner
4	24.59	3.61	Dinner

	total_bill	tip	sex	day	time	percentage_tip	count
0	16.99	2.01	Female	Sun	Dinner	0.059447	1
1	10.34	2.66	Male	Sun	Dinner	0.160542	1
2	21.01	4.50	Male	Sun	Dinner	0.166587	1
3	23.68	4.31	Male	Sun	Dinner	0.139780	1
4	24.59	4.61	Female	Sun	Dinner	0.146808	1

	total_bill	tip	sex	day	time	percentage_tip
1	10.34	1.66	Male	Sun	Dinner	0.160542
2	21.01	3.50	Male	Sun	Dinner	0.166587
5	25.29	4.71	Male	Sun	Dinner	0.186240
6	8.77	2.00	Male	Sun	Dinner	0.228050
9	14.78	3.23	Male	Sun	Dinner	0.218539

	day	avg_bill	total_bill
0	Fri	17.151579	325.88
1	Sat	20.441379	1778.40
2	Sun	21.410000	1627.16
3	Thur	17.682742	1096.33

	total_bill	tip	sex	smoker	day	time	size	row_number
67	3.07	1.00	Female	Yes	Sat	Dinner	1	1
172	7.25	5.15	Male	Yes	Sun	Dinner	2	1
149	7.51	2.00	Male	No	Thur	Lunch	2	1
92	5.75	1.00	Female	Yes	Fri	Dinner	2	1
222	8.58	1.92	Male	Yes	Fri	Lunch	1	2