Audience Splitting in A/B Experiments
A tutorial on how to split audience in a deterministic way using hashing.
About
One key element in running a A/B experiment is splitting of audience based on the unit of diversion. Most of the experiment platforms does the splitting of audience for us. But there are situation in which analyst need to run an A/B experiment and splitting of audience need to performed by the analyst. In most of the organizations data is stored in a database and it would be nice if we can perform treatment assignment in SQL . Also, we need the audience split to perform post-hoc analysis of the experiment. In this blog, I will show how to perform audience splitting in spark and Hive using an example.
- Lets create a spark session connected to local server.
- Lets create a dummy dataset with 100,000 customers along with gender information.
- Add uuid column to the dataframe to uniquely identify a user.
- Convert pandas dataframe to a spark dataframe
- Register the spark dataframe as "user_table" to be accessed in Hive
import pyspark
import altair as alt
import numpy as np
import pandas as pd
import uuid
import scipy.stats as sc
from vega_datasets import data
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.enableHiveSupport() \
.getOrCreate()
customers = (pd.DataFrame({'user': np.arange(100000),
'gender':[np.random.choice(['m','f'], p=[0.55,0.45]) for _ in np.arange(100000)]})
.assign(user_uuid=[uuid.uuid4() for _ in range(100000)])
)
customers.head()
sdf=spark.createDataFrame(customers.astype(str))
sdf.createOrReplaceTempView("user_table")
sdf.toPandas().head()
- Select the unit of diversion key : user_uuid in our case (or the ID field we want to split on).
- And a salt('new_widget' in our example), unique value to identify our experiment.
- Concatenate car_uuid with the salt selected.
- Apply a hashing algorithm like md5 hash to split audience into treatment and control
query="""select
user_uuid,
if(
conv(
substr(
md5(concat(user_uuid, '-','new_widget')),
1, 6),
16,10)/conv('ffffff',16,10) > 0.50, 'treatment', 'control') as treatment
,gender
from user_table
"""
df_audience=spark.sql(query).toPandas()
Lets visualize the split and looks like assignment is 50-50. But how do we validate this with statistically rigor ?
(df_audience
.groupby('treatment')
.agg(users=('user_uuid','count'))
.reset_index()
.assign(percent_users=lambda x:(x['users']/x['users'].sum())*100)
.style.format({'percent_users':'{0:.2f}%'.format})
)
One way to validate this is see if distribution of gender is random across treatment and control. This can be translated in to a chi square test with the following hypothesis:
Null Hypothesis H0: Gender is independent of treatment assignment
Alternate Hypothesis Ha: Gender is not independent of treatment assignment
Let's run an chi-square test. P-value of 0.14 indicates we can't reject the null hypothesis - gender is independent of the treatment assignment
chi2, p, dof, expected=sc.chi2_contingency(pd.crosstab(df_audience.treatment,
df_audience.gender,
values=df_audience.user_uuid,
aggfunc='count'))
print ("p-value is {}".format(p))