ANOVA
Published:
A tutorial of ANOVA analysis for treatments effects on metabolite abundance from animal studies.
Intro and model set-up
From my limited understanding on metabolomic profiles and some statistical experience related to metabolomic data analysis, it is not always necessary to conduct complicated pre-processing of the abundance data. Instead, any pre-processing should be justified ahead either from a purely biological perspective or an integrated statistical and biological perspective. For example, some metabolomic profiles after removing technical randomness can be appropriately assumed to follow a log-normal distribution. In this case, associating a random variable with log-normal distribution to the data is good enough. And the treatments effects are reflected on mean, while the variance amounts to uncontrolled random noise. Though, in practice, log-normal assumption may be invalid. To apply linear model, i.e. still assuming the profile is continuous, we can apply transformation. Of course, log-normal assumption involves log-transformation ($log(X)$) as well. Other types of transformation include, square ($X^2$), triple ($X^3$), inverse ($1/X$), square-root ($\sqrt{X}$), inverse-square-root ($1/\sqrt{X}$) and so on. The key assumption is after transformation, namely $g()$, $Y=g(X)$ follows normal distribution.
Now consider there are two factors/treatments/covariates/independent-variables under study, A=(M,F) (e.g. gender) and B=(0,10,30) (e.g. dose), and the response/dependent-variable/end-point is the profile of one specific metabolite. Assuming the random variable $X$ associated with the response variable is log-normally distributed, we can formally construct the following linear model.
\[log(X_{ijk})=Y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk},\]where $Y_{ijk}$ is transformed response of the $k^{th}$ sample in $i^{th}$ gender and $j^{th}$ dose group, $\espilon_{ijk$ is i.i.d. random error. ($i=1$, Female, $i=2$, Male; $j=1$, 0 ml/kg, $j=2$, 10 ml/kg, $j=3$, 30 ml/kg)
Or cell mean model (different ways to write the model):
\[Y_{ijk}=\mu_{ij} + \epsilon_{ijk}.\]Some commonly interested questions and statistical formulation
Suppose the design is balanced and there is no missing data. If we have 4 replicates for each combination of gender and dose levels, then, in total, we have $2\times 3 \times 4=24$ animals/observations.
There are several contrast we would be interested. For example:
- Averaged over the two genders, does diet affect this metabolite profile? (Main diet effect)
- What is the contrast for this question?
- To check what R output to answer this question?
\(H_0: [\mu + \sum_{i}(\alpha_i+\gamma_{i1})/2+ \beta_1] =[\mu + \sum_{i}(\alpha_i+\gamma_{i2})/2+ \beta_2] = [\mu + \sum_{i}(\alpha_i+\gamma_{i3})/2+ \beta_3]\) \(H_0: \sum_i(\gamma_{i1})/2 + \beta_1=\sum_i(\gamma_{i2})/2 + \beta_2=\sum_i(\gamma_{i3})/2 + \beta_3\) \(\text{Or } H_0: \bar\mu_{.1}=\bar\mu_{.2}=\bar\mu_{.3}\)
- Does gender have similar effects on animals’ metabolism based on each diet? (Interaction effect)
- What is the contrast for this question?
- To check what R output to answer this question?
\(H_0: (\mu+\alpha_1 + \beta_1 + \gamma_{11}) - (\mu+\alpha_2 + \beta_1 + \gamma_{21})=(\mu+\alpha_1 + \beta_2 + \gamma_{12}) - (\mu+\alpha_2 + \beta_2 + \gamma_{22})=(\mu+\alpha_1 + \beta_3 + \gamma_{13}) - (\mu+\alpha_2 + \beta_3 + \gamma_{23})\) \(H_0: (\alpha_1 + \gamma_{11}) - (\alpha_2 + \gamma_{21})=(\alpha_1 + \gamma_{12}) - (\alpha_2 + \gamma_{22})=(\alpha_1 + \gamma_{13}) - (\alpha_2 + \gamma_{23})\) \(\text{Or } H_0: \mu_{11}-\mu_{21}=\mu_{12}-\mu_{22}=\mu_{13}-\mu_{23}\)
- Is there gender effect when the given diet has high dose of xxxxxx? (Simple gender effect, conditioning on high dose of xxxxxx)
- What is the contrast for this question?
- To check what R output to answer this question?
\(H_0: (\mu+\alpha_1 + \beta_3 + \gamma_{13}) - (\mu+\alpha_2 + \beta_3 + \gamma_{23})=0\) \(H_0: \alpha_1 + \gamma_{13} - \alpha_2 - \gamma_{23}=0\) \(\text{Or } H_0: \mu_{13}-\mu_{23}=0\)
There are more we can do, for example, pair-wise comparison post-hoc tests.
Model assumptions assesment
Once the linear model is fitted, we typically check residual plots including studentized-residual v.s. predictions plot and Q-Q plot. We need to check (Normality and equal-variance):
- Are residuals highly skewed? If they are skewed, Normality assumption may not be valid and we may want to try other transformation and/or pre-processing.
- Are residuals showing some patterns indicating either some within group variability un-explained or lack of fit? This issue is rare for two-way ANOVA in my opinion but it happened, check other transformation and/or try to rule out some extreme within group variation, try to check if there is any outliers etc.
- If there is not any remedy, non-parametric model would be recommended.