About

This blog post details how to create confidence interval plot in python using Altair Visualization package. Altair is a declarative statistical visualization library based on vega and vega-lite. This is one my favorite visualization package in python. More details can be found here

Lets load the package and get data from cars data set.

import altair as alt
import numpy as np
import pandas as pd
from vega_datasets import data

source = data.cars()

source.head()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA

Create a plot showing how mile per gallon change by year

Altair has built in capabilities to create this visualization

  1. Lets create a base line chart showing the average mile per gallon per year
  2. Create a confidence interval band chart using the mark_errorband()
  3. Layer the line and CI band chart to create the final visualization
line = (alt
        .Chart(source).mark_line(color='blue')
        .encode(x='Year',
                y='mean(Miles_per_Gallon)'))

band = (alt
        .Chart(source)
        .mark_errorband(extent='ci',color='blue')
        .encode(x='Year',
                y=alt.Y('Miles_per_Gallon', title='Miles/Gallon')))

(band + line).properties(title='Confidence Interval Plot of miles per gallon')

Lets say if you want to understand how mileage varies by origin. This can be done by simply encoding color in the plot

line = (alt
        .Chart(source).mark_line(color='blue')
        .encode(x='Year',
                y='mean(Miles_per_Gallon)',
                color='Origin'))

band = (alt
        .Chart(source)
        .mark_errorband(extent='ci',color='blue')
        .encode(x='Year',
                y=alt.Y('Miles_per_Gallon', title='Miles/Gallon'),
                color='Origin'))

(band + line).properties(title='Confidence Interval of miles per gallon by country')

Create confidence interval plot from grouped data

Most of situation in real world you have large a dataset and still need to plot confidence interval plots.In this scenario it is better to pre compute the confidence interval based on mean and margin of error. Lets create a pandas data frame with required fields as show below :

df=(source
 .groupby(['Year'])
 .agg(avg_mpg=('Miles_per_Gallon','mean'),
     std_mpg=('Miles_per_Gallon','std'),
     n=('Miles_per_Gallon','count'))
 .assign(ul=lambda x:x['avg_mpg']+1.96*x['std_mpg']/np.sqrt(x['n']),
        ll=lambda x:x['avg_mpg']-1.96*x['std_mpg']/np.sqrt(x['n']))
 .reset_index()
)

df.head()
Year avg_mpg std_mpg n ul ll
0 1970-01-01 17.689655 5.339231 29 19.632937 15.746373
1 1971-01-01 21.250000 6.591942 28 23.691690 18.808310
2 1972-01-01 18.714286 5.435529 28 20.727634 16.700938
3 1973-01-01 17.100000 4.700245 40 18.556621 15.643379
4 1974-01-01 22.703704 6.420010 27 25.125345 20.282062

Few lines of code below create the custom confidence interval plot required

line = (alt
        .Chart()
        .mark_line(color='blue')
        .encode(x='Year',
                y='avg_mpg'))

band = (alt
        .Chart()
        .mark_area(opacity=0.5,color='blue')
        .encode(x='Year',
                y=alt.Y('ll', axis=alt.Axis(title='Miles/Gallon',ticks=False)),
                y2=alt.Y2('ul')))

alt.layer(band + line,data=df).properties(title='Confidence Interval of miles per gallon by country(Custom)')

Conclusion

Confidence interval plot is one the most important tool in a data scientist tool kit to understand uncertainty of the metrics. Altair provides excellent visualization capabilities to make this plot with few line of python code.