About

This blog post details how to create confidence interval plot in python using Altair Visualization package. Altair is a declarative statistical visualization library based on vega and vega-lite. This is one my favorite visualization package in python. More details can be found here

Lets load the package and get data from cars data set.

import altair as alt
import numpy as np
import pandas as pd
from vega_datasets import data

source = data.cars()

source.head()

Create a plot showing how mile per gallon change by year

Altair has built in capabilities to create this visualization

Lets create a base line chart showing the average mile per gallon per year
Create a confidence interval band chart using the mark_errorband()
Layer the line and CI band chart to create the final visualization

line = (alt
        .Chart(source).mark_line(color='blue')
        .encode(x='Year',
                y='mean(Miles_per_Gallon)'))

band = (alt
        .Chart(source)
        .mark_errorband(extent='ci',color='blue')
        .encode(x='Year',
                y=alt.Y('Miles_per_Gallon', title='Miles/Gallon')))

(band + line).properties(title='Confidence Interval Plot of miles per gallon')

Lets say if you want to understand how mileage varies by origin. This can be done by simply encoding color in the plot

line = (alt
        .Chart(source).mark_line(color='blue')
        .encode(x='Year',
                y='mean(Miles_per_Gallon)',
                color='Origin'))

band = (alt
        .Chart(source)
        .mark_errorband(extent='ci',color='blue')
        .encode(x='Year',
                y=alt.Y('Miles_per_Gallon', title='Miles/Gallon'),
                color='Origin'))

(band + line).properties(title='Confidence Interval of miles per gallon by country')

Create confidence interval plot from grouped data

Most of situation in real world you have large a dataset and still need to plot confidence interval plots.In this scenario it is better to pre compute the confidence interval based on mean and margin of error. Lets create a pandas data frame with required fields as show below :

df=(source
 .groupby(['Year'])
 .agg(avg_mpg=('Miles_per_Gallon','mean'),
     std_mpg=('Miles_per_Gallon','std'),
     n=('Miles_per_Gallon','count'))
 .assign(ul=lambda x:x['avg_mpg']+1.96*x['std_mpg']/np.sqrt(x['n']),
        ll=lambda x:x['avg_mpg']-1.96*x['std_mpg']/np.sqrt(x['n']))
 .reset_index()
)

df.head()

Few lines of code below create the custom confidence interval plot required

line = (alt
        .Chart()
        .mark_line(color='blue')
        .encode(x='Year',
                y='avg_mpg'))

band = (alt
        .Chart()
        .mark_area(opacity=0.5,color='blue')
        .encode(x='Year',
                y=alt.Y('ll', axis=alt.Axis(title='Miles/Gallon',ticks=False)),
                y2=alt.Y2('ul')))

alt.layer(band + line,data=df).properties(title='Confidence Interval of miles per gallon by country(Custom)')

Conclusion

Confidence interval plot is one the most important tool in a data scientist tool kit to understand uncertainty of the metrics. Altair provides excellent visualization capabilities to make this plot with few line of python code.

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA

	Year	avg_mpg	std_mpg	n	ul	ll
0	1970-01-01	17.689655	5.339231	29	19.632937	15.746373
1	1971-01-01	21.250000	6.591942	28	23.691690	18.808310
2	1972-01-01	18.714286	5.435529	28	20.727634	16.700938
3	1973-01-01	17.100000	4.700245	40	18.556621	15.643379
4	1974-01-01	22.703704	6.420010	27	25.125345	20.282062