Using Simulation to Estimate the Power of an A/B experiment
A tutorial on estimating power of an A/B experiment
About
This article was originally posted in my medium blog post here
Power of an experiment measures the ability of the experiment to detect a specific alternate hypothesis. For example, an e-commerce company is trying to increase the time users spend on the website by changing the design of the website. They plan to use the well-known two-sample t-test. Power helps in answering the question: will the t-test be able to detect a difference in mean time spend (if it exists) by rejecting the null hypothesis?
Lets state the hypothesis
Null Hypothesis H0: New design has no effect on the time users spend on the website
Alternate Hypothesis Ha: New design impacts the time users spend on the website
When an A/B experiment is run to measure the impact of the website redesign, we want to ensure that the experiment has at least 80% power. The following parameters impact the power of the experiment:
1. Sample size(n): Larger the sample size, smaller the standard error becomes; and makes sampling distribution smaller. Increasing the sample size, increases the power of the experiment
2. Effect size(𝛿): Difference between the means sampling distribution of null and alternative hypothesis. Smaller the effect size, need more samples to detect an effect at predefined power
3. Alpha(𝛼): Significance value is typically set at 0.05; this is the cut off at which we accept or reject our null hypothesis. Making alpha smaller requires more samples to detect an effect at predefined power
4. Beta(β): Power is defined as 1-β
Why power analysis is done to determine sample size before running an experiment?
- Running experiments is expensive and time consuming
- Increases the chance of finding significant effect
- Increases the chance of replicating an effect detected in an experiment
For example, the time users spend currently on the website is normally distributed with mean 2 minutes and standard deviation 1 minute. The product manager wants to design an experiment to understand if the redesigned website helps in increasing the time spent on the website.
The experiment should be able to detect a minimum of 5% change in time spent on the website. For a test like this, an exact solution is available to estimate sample size since sampling distribution is known. Here we will use the simulation method to estimate the sample and validate the same using exact method.
The following steps estimate the power of two-sample t-test:
- Simulate data for the model under null 𝒩(2,1) and alternate hypothesis 𝒩(2+𝛿,1)
- Perform t-test on the sample and record whether the t-test rejects the null hypothesis
- Run the simulation multiple number of times and count the number of times the t-test rejects the null hypothesis.
import numpy as np
import scipy.stats as st
# Initialize delta(minimum lift the product manager expect), control_mean, control_sd
delta=0.05
control_mean=2
control_sd=1
sample_size=1000
alpha=0.05#significance of the experiment
n_sim=1000#Total number of samples to simulate
np.random.seed(123)#set seed
def simulate_data(control_mean,control_sd,sample_size,n_sim):
# Simulate the time spend under null hypothesis
control_time_spent = np.random.normal(loc=control_mean, scale=control_sd, size=(sample_size,n_sim))
# Simulate the time spend under alternate hypothesis
treatment_time_spent = np.random.normal(loc=control_mean*(1+delta), scale=control_sd, size=(sample_size,n_sim))
return control_time_spent,treatment_time_spent
# Run the t-test and get the p_value
control_time_spent, treatment_time_spent=simulate_data(control_mean,control_sd,sample_size,n_sim)
t_stat, p_value = st.ttest_ind(control_time_spent, treatment_time_spent)
power=(p_value<0.05).sum()/n_sim
print("Power of the experiment {:.1%}".format(power))
#Power of the experiment 58.8%
#increment sample size till required power is reached
sample_size=1000
np.random.seed(123)
while True:
control_time_spent, treatment_time_spent=simulate_data(control_mean,control_sd,sample_size,n_sim)
t_stat, p_value = st.ttest_ind(control_time_spent, treatment_time_spent)
power=(p_value<alpha).sum()/n_sim
if power>.80:
print("Minimum sample size required to reach significance {}".format(sample_size))
break
else:
sample_size+=10
#Minimum sample size required to reach significance 1560
#Analtyical solution to compute sample size
from statsmodels.stats.power import tt_ind_solve_power
treat_mean=control_mean*(1+delta)
mean_diff=treat_mean-control_mean
cohen_d=mean_diff/np.sqrt((control_sd**2+control_sd**2)/2)
n = tt_ind_solve_power(effect_size=cohen_d, alpha=alpha, power=0.8, ratio=1, alternative='two-sided')
print('Minimum sample size required to reach significance: {:.0f}'.format(round(n)))