About

One key element in running a A/B experiment is splitting of audience based on the unit of diversion. Most of the experiment platforms does the splitting of audience for us. But there are situation in which analyst need to run an A/B experiment and splitting of audience need to performed by the analyst. In most of the organizations data is stored in a database and it would be nice if we can perform treatment assignment in SQL . Also, we need the audience split to perform post-hoc analysis of the experiment. In this blog, I will show how to perform audience splitting in spark and Hive using an example.

Data Preparation

Lets create a spark session connected to local server.
Lets create a dummy dataset with 100,000 customers along with gender information.
Add uuid column to the dataframe to uniquely identify a user.
Convert pandas dataframe to a spark dataframe
Register the spark dataframe as "user_table" to be accessed in Hive

import pyspark 
import altair as alt
import numpy as np
import pandas as pd
import uuid
import scipy.stats as sc
from vega_datasets import data
from pyspark.sql import SparkSession
spark = SparkSession \
          .builder \
          .enableHiveSupport() \
          .getOrCreate()

customers = (pd.DataFrame({'user': np.arange(100000),
                      'gender':[np.random.choice(['m','f'], p=[0.55,0.45]) for _ in np.arange(100000)]})
         .assign(user_uuid=[uuid.uuid4() for _ in range(100000)])
        )

customers.head()

sdf=spark.createDataFrame(customers.astype(str))
sdf.createOrReplaceTempView("user_table") 
sdf.toPandas().head()

Audience splitting

Cool hashing trick to perform audience splitting

Select the unit of diversion key : user_uuid in our case (or the ID field we want to split on).
And a salt('new_widget' in our example), unique value to identify our experiment.
Concatenate car_uuid with the salt selected.
Apply a hashing algorithm like md5 hash to split audience into treatment and control

query="""select 
user_uuid,
if(
   conv(
      substr(
          md5(concat(user_uuid, '-','new_widget')),
          1, 6),
      16,10)/conv('ffffff',16,10) > 0.50, 'treatment', 'control') as treatment
,gender
from user_table
"""
df_audience=spark.sql(query).toPandas()

Validation of assignment

Chi-Square test of indepence is our friend

Lets visualize the split and looks like assignment is 50-50. But how do we validate this with statistically rigor ?

(df_audience
 .groupby('treatment')
 .agg(users=('user_uuid','count'))
 .reset_index()
 .assign(percent_users=lambda x:(x['users']/x['users'].sum())*100)
 .style.format({'percent_users':'{0:.2f}%'.format})
)

One way to validate this is see if distribution of gender is random across treatment and control. This can be translated in to a chi square test with the following hypothesis:

Null Hypothesis H₀: Gender is independent of treatment assignment
Alternate Hypothesis H_a: Gender is not independent of treatment assignment

Let's run an chi-square test. P-value of 0.14 indicates we can't reject the null hypothesis - gender is independent of the treatment assignment

chi2, p, dof, expected=sc.chi2_contingency(pd.crosstab(df_audience.treatment,
                                                       df_audience.gender,
                                                       values=df_audience.user_uuid,
                                                       aggfunc='count'))
print ("p-value is {}".format(p))

p-value is 0.14426225571462634

Conclusion

Hashing is very useful technique to assign users to treatment and control in a deterministic way. Using the user_uuid and salt we can get the experiment assignment back. This can also be done easily in any SQL database.

	user	gender	user_uuid
0	0	m	817be0d1-067c-41b8-86bc-ef6ab335ff46
1	1	m	afbac2c3-c2ae-413d-9d00-712da8ce5eb2
2	2	m	c8c990fa-7884-4c1d-89e2-d5e8af0a33fe
3	3	m	43fd874f-4644-405a-ae5e-44c01c7d3871
4	4	f	9d78651b-d55f-4d7b-bce7-5d036b95ac6c

	user	gender	user_uuid
0	0	f	12d288b0-91e3-471c-849f-38b6e3961a88
1	1	m	b1ea28f2-35fd-4334-92f9-e19fb3cfc924
2	2	f	e636cd3d-6182-4ee0-98d9-bed9350c996d
3	3	f	5f053ff3-5965-4114-808e-636e83c22647
4	4	f	f32af45d-36ff-4996-9704-99f9143a03de

	treatment	users	percent_users
0	control	50180	50.18%
1	treatment	49820	49.82%