About

One key element in running a A/B experiment is splitting of audience based on the unit of diversion. Most of the experiment platforms does the splitting of audience for us. But there are situation in which analyst need to run an A/B experiment and splitting of audience need to performed by the analyst. In most of the organizations data is stored in a database and it would be nice if we can perform treatment assignment in SQL . Also, we need the audience split to perform post-hoc analysis of the experiment. In this blog, I will show how to perform audience splitting in spark and Hive using an example.

Data Preparation

  1. Lets create a spark session connected to local server.
  2. Lets create a dummy dataset with 100,000 customers along with gender information.
  3. Add uuid column to the dataframe to uniquely identify a user.
  4. Convert pandas dataframe to a spark dataframe
  5. Register the spark dataframe as "user_table" to be accessed in Hive
import pyspark 
import altair as alt
import numpy as np
import pandas as pd
import uuid
import scipy.stats as sc
from vega_datasets import data
from pyspark.sql import SparkSession
spark = SparkSession \
          .builder \
          .enableHiveSupport() \
          .getOrCreate()

customers = (pd.DataFrame({'user': np.arange(100000),
                      'gender':[np.random.choice(['m','f'], p=[0.55,0.45]) for _ in np.arange(100000)]})
         .assign(user_uuid=[uuid.uuid4() for _ in range(100000)])
        )

customers.head()
user gender user_uuid
0 0 m 817be0d1-067c-41b8-86bc-ef6ab335ff46
1 1 m afbac2c3-c2ae-413d-9d00-712da8ce5eb2
2 2 m c8c990fa-7884-4c1d-89e2-d5e8af0a33fe
3 3 m 43fd874f-4644-405a-ae5e-44c01c7d3871
4 4 f 9d78651b-d55f-4d7b-bce7-5d036b95ac6c
sdf=spark.createDataFrame(customers.astype(str))
sdf.createOrReplaceTempView("user_table") 
sdf.toPandas().head()
user gender user_uuid
0 0 f 12d288b0-91e3-471c-849f-38b6e3961a88
1 1 m b1ea28f2-35fd-4334-92f9-e19fb3cfc924
2 2 f e636cd3d-6182-4ee0-98d9-bed9350c996d
3 3 f 5f053ff3-5965-4114-808e-636e83c22647
4 4 f f32af45d-36ff-4996-9704-99f9143a03de

Audience splitting

Cool hashing trick to perform audience splitting

  1. Select the unit of diversion key : user_uuid in our case (or the ID field we want to split on).
  2. And a salt('new_widget' in our example), unique value to identify our experiment.
  3. Concatenate car_uuid with the salt selected.
  4. Apply a hashing algorithm like md5 hash to split audience into treatment and control
query="""select 
user_uuid,
if(
   conv(
      substr(
          md5(concat(user_uuid, '-','new_widget')),
          1, 6),
      16,10)/conv('ffffff',16,10) > 0.50, 'treatment', 'control') as treatment
,gender
from user_table
"""
df_audience=spark.sql(query).toPandas()

Validation of assignment

Chi-Square test of indepence is our friend

Lets visualize the split and looks like assignment is 50-50. But how do we validate this with statistically rigor ?

(df_audience
 .groupby('treatment')
 .agg(users=('user_uuid','count'))
 .reset_index()
 .assign(percent_users=lambda x:(x['users']/x['users'].sum())*100)
 .style.format({'percent_users':'{0:.2f}%'.format})
)
treatment users percent_users
0 control 50180 50.18%
1 treatment 49820 49.82%

One way to validate this is see if distribution of gender is random across treatment and control. This can be translated in to a chi square test with the following hypothesis:

Null Hypothesis H0: Gender is independent of treatment assignment
Alternate Hypothesis Ha: Gender is not independent of treatment assignment

Let's run an chi-square test. P-value of 0.14 indicates we can't reject the null hypothesis - gender is independent of the treatment assignment

chi2, p, dof, expected=sc.chi2_contingency(pd.crosstab(df_audience.treatment,
                                                       df_audience.gender,
                                                       values=df_audience.user_uuid,
                                                       aggfunc='count'))
print ("p-value is {}".format(p))
p-value is 0.14426225571462634

Conclusion

Hashing is very useful technique to assign users to treatment and control in a deterministic way. Using the user_uuid and salt we can get the experiment assignment back. This can also be done easily in any SQL database.