Confidence Interval Plot Python
A tutorial on how to create confidence interval plot in python.
About
This blog post details how to create confidence interval plot in python using Altair Visualization package. Altair is a declarative statistical visualization library based on vega and vega-lite. This is one my favorite visualization package in python. More details can be found here
Lets load the package and get data from cars data set.
import altair as alt
import numpy as np
import pandas as pd
from vega_datasets import data
source = data.cars()
source.head()
Create a plot showing how mile per gallon change by year
Altair has built in capabilities to create this visualization
- Lets create a base line chart showing the average mile per gallon per year
- Create a confidence interval band chart using the mark_errorband()
- Layer the line and CI band chart to create the final visualization
line = (alt
.Chart(source).mark_line(color='blue')
.encode(x='Year',
y='mean(Miles_per_Gallon)'))
band = (alt
.Chart(source)
.mark_errorband(extent='ci',color='blue')
.encode(x='Year',
y=alt.Y('Miles_per_Gallon', title='Miles/Gallon')))
(band + line).properties(title='Confidence Interval Plot of miles per gallon')
Lets say if you want to understand how mileage varies by origin. This can be done by simply encoding color in the plot
line = (alt
.Chart(source).mark_line(color='blue')
.encode(x='Year',
y='mean(Miles_per_Gallon)',
color='Origin'))
band = (alt
.Chart(source)
.mark_errorband(extent='ci',color='blue')
.encode(x='Year',
y=alt.Y('Miles_per_Gallon', title='Miles/Gallon'),
color='Origin'))
(band + line).properties(title='Confidence Interval of miles per gallon by country')
Most of situation in real world you have large a dataset and still need to plot confidence interval plots.In this scenario it is better to pre compute the confidence interval based on mean and margin of error. Lets create a pandas data frame with required fields as show below :
df=(source
.groupby(['Year'])
.agg(avg_mpg=('Miles_per_Gallon','mean'),
std_mpg=('Miles_per_Gallon','std'),
n=('Miles_per_Gallon','count'))
.assign(ul=lambda x:x['avg_mpg']+1.96*x['std_mpg']/np.sqrt(x['n']),
ll=lambda x:x['avg_mpg']-1.96*x['std_mpg']/np.sqrt(x['n']))
.reset_index()
)
df.head()
Few lines of code below create the custom confidence interval plot required
line = (alt
.Chart()
.mark_line(color='blue')
.encode(x='Year',
y='avg_mpg'))
band = (alt
.Chart()
.mark_area(opacity=0.5,color='blue')
.encode(x='Year',
y=alt.Y('ll', axis=alt.Axis(title='Miles/Gallon',ticks=False)),
y2=alt.Y2('ul')))
alt.layer(band + line,data=df).properties(title='Confidence Interval of miles per gallon by country(Custom)')
Confidence interval plot is one the most important tool in a data scientist tool kit to understand uncertainty of the metrics. Altair provides excellent visualization capabilities to make this plot with few line of python code.