About

This article was originally posted in my medium blog post here

Power of an experiment measures the ability of the experiment to detect a specific alternate hypothesis. For example, an e-commerce company is trying to increase the time users spend on the website by changing the design of the website. They plan to use the well-known two-sample t-test. Power helps in answering the question: will the t-test be able to detect a difference in mean time spend (if it exists) by rejecting the null hypothesis?

Lets state the hypothesis

Null Hypothesis H0: New design has no effect on the time users spend on the website
Alternate Hypothesis Ha: New design impacts the time users spend on the website

When an A/B experiment is run to measure the impact of the website redesign, we want to ensure that the experiment has at least 80% power. The following parameters impact the power of the experiment:

1. Sample size(n): Larger the sample size, smaller the standard error becomes; and makes sampling distribution smaller. Increasing the sample size, increases the power of the experiment
2. Effect size(𝛿): Difference between the means sampling distribution of null and alternative hypothesis. Smaller the effect size, need more samples to detect an effect at predefined power
3. Alpha(𝛼): Significance value is typically set at 0.05; this is the cut off at which we accept or reject our null hypothesis. Making alpha smaller requires more samples to detect an effect at predefined power
4. Beta(β): Power is defined as 1-β

Why power analysis is done to determine sample size before running an experiment?

  1. Running experiments is expensive and time consuming
  2. Increases the chance of finding significant effect
  3. Increases the chance of replicating an effect detected in an experiment

For example, the time users spend currently on the website is normally distributed with mean 2 minutes and standard deviation 1 minute. The product manager wants to design an experiment to understand if the redesigned website helps in increasing the time spent on the website.

The experiment should be able to detect a minimum of 5% change in time spent on the website. For a test like this, an exact solution is available to estimate sample size since sampling distribution is known. Here we will use the simulation method to estimate the sample and validate the same using exact method.

The following steps estimate the power of two-sample t-test:

  1. Simulate data for the model under null 𝒩(2,1) and alternate hypothesis 𝒩(2+𝛿,1)
  2. Perform t-test on the sample and record whether the t-test rejects the null hypothesis
  3. Run the simulation multiple number of times and count the number of times the t-test rejects the null hypothesis.

Code to compute power of experiment for a specified sample size, effect size and significance level:

Power of the experiment is 58.8% with sample size of 1000

import numpy as np
import scipy.stats as st
# Initialize delta(minimum lift the product manager expect), control_mean, control_sd
delta=0.05
control_mean=2
control_sd=1
sample_size=1000
alpha=0.05#significance of the experiment
n_sim=1000#Total number of samples to simulate

np.random.seed(123)#set seed
def simulate_data(control_mean,control_sd,sample_size,n_sim):
    # Simulate the time spend under null hypothesis
    control_time_spent = np.random.normal(loc=control_mean, scale=control_sd, size=(sample_size,n_sim))
    # Simulate the time spend under alternate hypothesis
    treatment_time_spent = np.random.normal(loc=control_mean*(1+delta), scale=control_sd, size=(sample_size,n_sim))
    return control_time_spent,treatment_time_spent
# Run the t-test and get the p_value
control_time_spent, treatment_time_spent=simulate_data(control_mean,control_sd,sample_size,n_sim)
t_stat, p_value = st.ttest_ind(control_time_spent, treatment_time_spent)
power=(p_value<0.05).sum()/n_sim
print("Power of the experiment {:.1%}".format(power))
#Power of the experiment 58.8%
Power of the experiment 58.8%

Code to compute sample size required to reach 80% power for specified effect size and significance level:

Based on simulation methods we need 1560 users to reach power of 80% and this closely matches with sample size estimated using exact method

#increment sample size till required power is reached 
sample_size=1000
np.random.seed(123)
while True:
    control_time_spent, treatment_time_spent=simulate_data(control_mean,control_sd,sample_size,n_sim)
    t_stat, p_value = st.ttest_ind(control_time_spent, treatment_time_spent)
    power=(p_value<alpha).sum()/n_sim
    if power>.80:
        print("Minimum sample size required to reach significance {}".format(sample_size))
        break
    else:
        sample_size+=10
#Minimum sample size required to reach significance 1560
Minimum sample size required to reach significance 1560

Code to compute sample size using exact method:

#Analtyical solution to compute sample size
from statsmodels.stats.power import tt_ind_solve_power

treat_mean=control_mean*(1+delta)
mean_diff=treat_mean-control_mean

cohen_d=mean_diff/np.sqrt((control_sd**2+control_sd**2)/2)

n = tt_ind_solve_power(effect_size=cohen_d, alpha=alpha, power=0.8, ratio=1, alternative='two-sided')
print('Minimum sample size required to reach significance: {:.0f}'.format(round(n)))
Minimum sample size required to reach significance: 1571

Conclusion

This article explained how simulation can be used to estimate power of an A/B experiment when a closed form solution doesn’t exist.