The Pandas Reference
A tutorial on how to write clean pandas code to perform data analysis.
About
Much of data exists in rectangular format with rows and columns. Different terms can be used to describe these kind of data
- Table
- Data frame
- Structured data
- Spreadsheets
Pandas is one of the widely used data manipulation library in python for structured datasets. Below is a summary of the key operations that are part of any essential data analysis project(SQL equivalents).
- Select column references
- Select scalar expression
- Where
- Group By
- Aggregation
- Order By
- Window functions
- Join
When I started using pandas, realized that there are multiple ways to perform the same operation.Also, code I was writing was not as elegant as SQL queries and hard to debug. In this blog post I will share examples of how to perform the above mentioned SQL operations in pandas and write pandas code that is readable and easy to maintain.
import pandas as pd
import numpy as np
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
pd.options.display.max_rows = 20
df.head(5)
Select total_bill and tips column from the data. Note: we are using method chaining to perform operations one after another
(df
.loc[:,['tip','sex']]
.head()
)
Select only columns starting with the letter 't'. Using this simple and readable syntax enables one to perform complex select operations in pandas
(df
.loc[:,[col for col in df.columns if col.startswith('t')]]
.head()
)
(df
.loc[:,['total_bill','tip','sex','day','time']]
.assign(percentage_tip=lambda x:x['tip']/x['total_bill']) #add new column
.assign(tip=lambda x:x['tip']+1) # update existing column
.assign(count=1) #add constant value
.head()
)
#filter only transaction with more than 15% in tips
(df
.loc[:,['total_bill','tip','sex','day','time']]
.assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
.query("percentage_tip>.15")
.head()
)
per_tip=.15
#using @ within query to refer a variable in the filter
print("")
display(df
.loc[:,['total_bill','tip','sex','day','time']]
.assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
.query("percentage_tip>@per_tip")
.head()
)
#using f-string to perform filtering
display(df
.loc[:,['total_bill','tip','sex','day','time']]
.assign(percentage_tip=lambda x:x['tip']/x['total_bill'])
.query(f"percentage_tip>{per_tip}")
.head()
)
#Filter only transactions happend on Sunday and Monday
(df
.loc[:,['total_bill','tip','sex','day','time']]
.query("day.isin(['Sun','Mon'])",engine='python')
.head()
)
#By day get average and total bill
(df
.groupby(['day'])
.agg(avg_bill=('total_bill','mean')
,total_bill=('total_bill','sum')) #multiple column aggregations supported
.reset_index()
)
#By day get average of total bill using : functions, lambda functions, numpy functions
(df
.groupby(['day'])
.agg(avg_bill_mean=('total_bill','mean')
,avg_bill_lambda=('total_bill',lambda x:x.mean()) #using lambda functions
,avg_bill_np=('total_bill',np.mean)) #using numpy functions
.reset_index()
)
Ordering rows
Most of the data analysis tasks requires sorting as a preprocessing step or as a last step to display output. This can be done in pandas by using sort_values function
Use sort_values to order a pandas data frame along the column/axis specified
.sort_values(['col1','col2'],ascending=[True,False])
#By day get average and total bill.Sort the output by total_bill
(df
.groupby(['day'])
.agg(avg_bill=('total_bill','mean')
,total_bill=('total_bill','sum'))
.reset_index()
.sort_values(['total_bill']) #Default in ascending
)
#By day get average and total bill.Sort the output by total_bill
(df
.groupby(['day'])
.agg(avg_bill=('total_bill','mean')
,total_bill=('total_bill','sum'))
.reset_index()
.sort_values(['total_bill'],ascending=[False]) #By descending order
)
#By day get average and total bill.Sort the output by total_bill and avg_bill
(df
.groupby(['day'])
.agg(avg_bill=('total_bill','mean')
,total_bill=('total_bill','sum'))
.reset_index()
.sort_values(['total_bill','avg_bill'],ascending=[False,True]) #By multiple columns one by asc and other by desc
)
#Equivalent of row_number() over(partition by day order by total_bill asc) as row_number
(df
.assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
.sort_values(['row_number'])
.head()
)
#Equivalent of lag(total_bill) over(partition by day order by total_bill asc) as previous_bill
(df
.assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
.assign(prev_bill=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day'])['total_bill'].shift(1))
.sort_values(['row_number'])
.head()
)
#Equivalent of lead(total_bill) over(partition by day order by total_bill asc) as previous_bill
(df
.assign(row_number=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day']).cumcount()+1)
.assign(next_bill=lambda x:x.sort_values(['total_bill'],ascending=[True]).groupby(['day'])['total_bill'].shift(-1))
.sort_values(['row_number'])
.head()
)
#Equivalent of sum(total_bill) over(partition by day) as sum_bill_day
#Equivalent of sum(tip) over(partition by day order by total bill asc) as cum_tip_day
#Equivalent of sum(tip) over(partition by day order by total. bill rows between 3 preceeding and current row) as rolling_3d_sum
(df
.assign(sum_bill_day=lambda x:x.groupby(['day'])['total_bill'].transform('sum'))
.assign(cum_tip_day=lambda x:x.sort_values(['total_bill']).groupby(['day'])['tip'].cumsum())
.assign(rolling_3d_sum=lambda x:x.sort_values(['total_bill']).groupby(['day'])['tip'].rolling(2,min_period=1).sum().reset_index(drop=True, level=0))
.query("day=='Sat'")
.sort_values(['total_bill'])
.head()
)