Statistics

“Never say never”

This well known idiom reflects that we must approach any statements with the appropriate amount of scepticism. We cannot be totally sure that any statement or conclusion is really true. Usually, science starts with a null hypothesis and an alternative hypothesis. The null hypothesis can be accepted or rejected. But with this decision is associated a type 1 error ($ \alpha $: probability of a false positive, i.e. rejection of the null hypothesis when it is actually true) and a type 2 error ($ \beta $: possibility of a false negative, i.e. acceptance of the null hypothesis when it is actually false). For this reason, science relies on statistics that indicate how confident one can be that a certain statement is really true, either using a $ p $-value, i.e. the probability that differences between the compared groups are due to the fact that the random sample from the entire population may have been an outlier (random sampling error), or using confidence intervalls.

It is recommended to use state-of-the-art software that provides statistical methods that are robust even if the data deviates slightly from the requirements of the statistical test used. One recommended software is R with the packages asht (Fay and Kim 2016; Fay 2018) and WRS2 (Mair and Wilcox 2020).

There are several review articles that inform which statistical tests should be used under given circumstances. (du Prel et al. 2010, McDonald 2014, Leeper 2021).

Power analysis

Power analysis is important when planning a study and designing experiments. Test power is the complement of type II error. The relationship between type I error, type II error or its complement (test power), critical effect size and sample size depends on the statistical test used. To determine the minimum sample size required to detect the critical effect size, power analysis is performed as follows:

Choose the minimum effect size you want to detect (10%)
Choose minimum significance level (acceptable probability of type I error, e.g. $ \alpha = 0.05 $) according to risk assessment (Mudge et al. 2012)
Choose power according to risk assessment ($ Pwr = 0.8 = 80\% $) (propability that an actual effect results in correct rejection of $ H_0 $ and acceptance of $ H_1 $ at $ \alpha = 0.05 $ and thereby correct significant ($ p < α $) result). It follows that $ β = 1 - Pwr = 0.2 $ is the propability for false negative acceptance of null hypothesis $ H_0 $.
Calculate minimal required sample number $ n $ using R with the package pwr or using Python with the package statsmodels. The package statsmodels provides functions that allow power anaylysis for t-tests: TTestIndPower().solve_power(), or statsmodels.stats.power.tt_ind_solve_power as well as TTestPower().solve_power(). It also provides functions that allow power analysis for a one-way ANOVA: [statsmodels.stats.power.FTestAnovaPower.power]https://www.statsmodels.org/stable/generated/statsmodels.stats.power.FTestAnovaPower.power.html).

Small sample sizes

When sample sizes are small, caution is required. For sample sizes below n=5, means and parametric tests are usually meaningless and standard deviation is underestimated. Tests for normal distribution also typically require $n > 30$ (Delacre et al. 2017). Especially for small sample sizes, the exact Wilcoxon-Mann-Whitney test is to be preferred when compared to variants with large sample asymptotic (normal) approximation (Bergmann et al. 2000; Fray 2018). It needs a sample size of n₁=4, n₂=4 or n₁=5, n₂=3 to produce u = p < α = 0.05 (one sided) or n₁=5, n₂=4 to produce u = p/2 < α/2 = 0.025 (two sided) (Milton 1964). Spearman’s Rank correlation coefficient needs minimum n=4 (one sided) or n=5 (two sided) or better n=6 (Rhamsey 1989).

Common misconceptions

Wilcoxon-Mann Whitney test

Note that this is a test for differences in mean ranks. Only if the two populations have the same distribution shape this tests for distribution shift (differences in medians and thus for differences in means) (Campbell 2006).

Parametric tests

For parametric tests such as t-stests, requirement 1) is not “normal distribution of the parameter under observation for the underlying population or the sample but normal distribution of the sample means.

But Student’s t-test also requires 2) that the that the variance s in the denominator be such that $ \frac {s^2} {σ^2} \approx \chi^2_d $, where $ d $ is the degree of freedom, and 3) numerator and denominator be independent. Only the normal distribution has all three together: The third requirement characterizes normal distribution (Lukacs, 1942), and for finite numbers of independent random variables, only the normal distribution meets the first requirement (Cramér’s decomposition theorem) (See here).

However, the central limit theorem (CLT) and Slutsky’s theorem together give you (as long as all their assumptions hold) that as $ n \to \infty $, the distribution of the t-statistic approaches standard normal. It doesn’t say whether any given finite $ n $ might be enough for some purpose. It is useful to know, though because this means you can use tests with normal approximation (e.g. z test) for large sample sizes ( n≥100 (Schröder and Yitzhaki 2018; Hall and Wang 2004) are enough in most cases but the numbers are arbitrary. Thus, if we have samples consisting of hundreds of observations, we can ignore the distribution of the data (Altman et al. 1995; Lehman 2002). The actual required minimum sample size is arbitrary and depends on the deviation from normality and risk assessment. When checking normality, we are checking to see if the sampling distribution (or the distribution of the sample mean) is normal. And the CLT says it will be approximately normal if the size of each sample is large enough. Note that the CLT suggests that for large $ n $ t and z statistics are approximations for each other (see above) and for large $ n $ the $ z $ statistic can be used for any distribution. However, if both approximations are combined to state that for the given sample size the t statistic can be used for any distribution, the n should be chosen twice as high as for a single approximation (which would mean $ n \geq 200 $ ).

Along these lines, the normality assumption for Pearson correlation coefficients is also commonly misstated in MANY, if not most, online and textbook sources as requiring the variables being correlated to be normally distributed. X & Y variables do not require normal distributions, but only require that the sampling distribution of r (or t that is used to test the statistical significance of r) be normally distributed. The CRT can be employed again to validate that the Pearson correlation coefficient can be used for large sample sizes.

Recent research has made progress in applying t statistics to data that deviates from normality (Hall and Wang 2004; Unnikrishnan and Huang 2016; Schröder and Yitzhaki 2018). However, critical researchers showed, that normal distribution is rare in real life (Pococ 2011; Wilcox 2013; Wilcox and Rousselet 2018) and that tests for normality are problematic (Wright and Herrington 2011) while at the same time even small deviations from normal distribution can result in failure of classic statistical test. That is why Wilcox and others collect and introduce robust alternatives to classical parametric tests (Wilcox 2013; Wilcox 2016; Wilcox 2017; Nair and Sankaran 2009; Sankaran et al 2016; Nair and Balakrishnan 2016; Nair et al. 2018; Wilcox and Rousselet 2018).

Tests are more or less robust with violations of the remaining requirements. Student’s t test is reportedly robust with violation of the equality of variances. To compensate for deviations from the requirements, some recommend to reduce the significance level (Lehman 2002). However, because tests for equality of variances often give wrong results and because normal distribution is rarely given, others recommend Welch’s t-test, which is the default behaviour for the function t.test() in R (Delacre et al. 2017).

Analysis of Variances (ANOVA)

I provide an example for an analysis of variances (one-way ANOVA) using Python at https://gitlab.com/DataAnalysisByDerAndere/statisticalanalysisbyderandere/pystatisticalanalysis. The simplest form of ANOVA is a one-way ANOVA, which tests for differences in means in a single dependent variable between two or more groups (also called “levels”) when the dependent variable depends on a single independent variable (also called “factor” or “predictor”) while all other factors are considered constant. For such purposes, raw data may exist in wide format (“pivoted”), i.e. tables like this:

index	factor 1 (group A)	factor 1 (group B)	factor 1 (group C)
0	2	3	3
1	1	3	4
2	2	2	4
3	2	3	5
4	2	3	4

This data, stored in comma seperated text file (CSV format) can be read into a pandas.dataFrame object using the “pandas” package and reshaped to produce data in long format (“un-pivoted”) using the melt() function from said package:

import pandas
df = pd.read_csv
df_melt = pandas.melt(df)

index	group	value
0	A	3
1	A	3
2	A	2
3	A	3
4	A	3
5	B	3
6	B	3
7	B	3
8	B	3
9	B	3
10	C	3
11	C	3
12	C	3
13	C	3
14	C	3

The ANOVA result is obtained as an object and output of the ANOVA table summary is printed to the stdout stream with the following commands:

import statsmodels
import patsy
x, Y = patsy.dmatrices('value ~ C(group)', data=df_melt)
model = statsmodels.OLS(x, Y)
fitted_model = model.fit()
table = statsmodels.api.stats.anova_lm(ols(fitted_model, typ=2))
print(table)

The first argument passed to the anova_lm function is a fitted model, the typ argument is the type of the sums of squares (SSQ) used. typ=2 means type II SSQ which is appropriate if there is no interaction between independent variables. Like with regression, an ordinary least square model is almost always used for ANOVA. Such a model can be generated with the ols function. In fact linear regression and the original ANOVA are closely related. When using the statsmodels module, the arguments of the functions to generate a model, such as OLS are a design matrix for the dependent variables and a design matrix for the independent variables. These design matrices can be created using the function dmatrices() from package “patsy”. Its first argument is a string that specifies the design formula using a specific syntax that is derived from the notation introduced by Wilkinson and Rogers (1973). A similar syntax is also used for model formula in R. The ~ seperates independent variables on the left hand side from dependent variables (factors, predictors) on the right hand side. The function [C() marks some data as being categorical (including data which would not automatically be treated as categorical, such as a column of integers), while also optionally setting the preferred coding scheme and level ordering](C() marks some data as being categorical (including data which would not automatically be treated as categorical, such as a column of integers), while also optionally setting the preferred coding scheme (“contrasts”) and level ordering](https://patsy.readthedocs.io/en/latest/categorical-coding.html).

Alternatively, such a design formula can be directly passed to the functions form the statsmodels.formula.api module that return a model, such as the ols function as the first argument. The second argument of the ols function is a data argument:

import statsmodels
model2 = statsmodels.formula.api.ols('value ~ C(group)', data=df_melt)
fitted_model2 = model2.fit()
table2 = statsmodels.api.stats.anova_lm(ols(fitted_model2, typ=2))
print(table2)

Methods to add Eta Squared and Omega Squared (measures for effect size) to the output is easy to implement. (Maszanski 2021):

def eta_squared(aov):
    aov['eta_sq'] = 'NaN'
    aov['eta_sq'] = aov[:-1]['sum_sq']/sum(aov['sum_sq'])
    return aov

def omega_squared(aov):
    mse = aov['sum_sq'][-1]/aov['df'][-1]
    aov['omega_sq'] = 'NaN'
    aov['omega_sq'] = (aov[:-1]['sum_sq']-(aov[:-1]['df']*mse))/(sum(aov['sum_sq'])+mse)
    return aov

eta_squared(table2)
omega_squared(table2)
print(table2)

For factorial ANOVA (e.g. two-way ANOVA) the design formula that is used to describe the model has to change. An excellent explanation is available online (Python for Data Science, LLC 2020). More examples for factorial ANOVA can be found in the statsmodels documentation. There exist three main strategies to calculate sums of squares used during the ANOVA. Type I is default in R and is dependent on the sequence of independent variables (factors) in the model design formula. type II is independent of the sequence and is best used if interaction plots do not indicate significant interaction betweeen independent variables. For balanced designs, type I and II should yield identical results. Type III is default in commercial software like SPSS. To perfom a type III ANOVA with the Python package statsmodels, the documentation mentions that one must not use non-orthogonal contrasts (i.e. Treatment). Python for Data Science also mentions that one must add “, Sum” in function C() as a second argument after the name of the factor within the model design formula. Background: For factorial ANOVA, orthogonal contrasts have to be used. By default, R uses “traditional dummy coding” (options(contrasts = c("contr.treatment", "contr.poly")) which means “treatment contrasts” for unordered factors and “polynomial trend contrasts” for ordered factors). Similarly, the Python packages patsy and statsmodels use Treatment coding for categorical factors by default. Since these contrasts are non-orthogonal, the first independent variable in the model’s design formula is used as the reference and releveling changes the results. This is not appropriate for factorial ANOVA, at least when using type III sum of squares. Usually, factors should be set to use “Sum-to-zero” contrasts (effects coding). The examples in the statsmodels documentation show one way to set orthogonal contrasts. Details can be found in the patsy API reference. More explanations and examples in R can be found in Maier (2015).

To analyze within-subject effects, e.g. in time series experiments, one should use repeated measures ANOVA (e.g. with statsmodels.stats.anova.AnovaRM()).

The Python package “pingouin” provides dedicated but less flexible functions for repeated measures ANOVA and ANCOVA. It also provides methods for Welch ANOVA for unbalanced data (unequal sample size) and/or unequal variances. For advanced purposes, use Pymer4 is a Python package that wraps the R package “lme4” to generate general linear models or non-linear models and bundles some more utility functions.

Use Levene’s test scipy.stats.levene() for homogenity of variances if data does not follow normal distribution.

DERANDERE

Statistics

Statistics

Power analysis

Small sample sizes

Common misconceptions

Analysis of Variances (ANOVA)

Further reading

Related posts: