A tutorial for ANOVA analysis of treatments effects on metabolite abundance profile from animal studies.

Intro and model set-up

From my limited understanding on metabolite profile and some statistical experience related to metabolomics data analysis, it is not always necessary to conduct complicated pre-processing of the abundance data. Instead, any pre-processing should be justified ahead either from a purely biological perspective or an integrated statistical and biological perspective. For example, some metabolites’ profile, after controlling technical randomness, can be appropriately assumed to follow a log-normal distribution. In this case, associating a random variable with log-normal distribution to each metabolite profile is good enough. And treatment effects are reflected on mean, while the variance amounts to uncontrolled random noise. Though, in practice, log-normal assumption may not be satisfied. To apply linear model, we can apply transformation. Log-normal assumption involves log-transformation (\(log(X)\)) as well. Other types of transformation include, but not limited to, square (\(X^2\)), cubic (\(X^3\)), inverse (\(1/X\)), square-root (\(\sqrt{X}\)), inverse-square-root (\(1/\sqrt{X}\)) and so on. The key assumption is after transformation, namely \(g()\), \(Y=g(X)\) follows normal distribution.

Now consider there are two factors/treatments/covariates/independent-variables under study, for example, A=(M,F) (e.g. gender) and B=(0,10,30) (e.g. dose), and the response/dependent-variable/end-point is the abundance of one specific metabolite. Assuming the random variable \(X\) associated with the response variable is log-normally distributed, we can formally write the following linear model.

\[log(X_{ijk})=Y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk},\]

where \(Y_{ijk}\) is transformed response of the \(k^{th}\) sample in \(i^{th}\) gender and \(j^{th}\) dose group, \(\epsilon_{ijk}\) is i.i.d. random error. (\(i=1\), Female, \(i=2\), Male; \(j=1\), 0 ml/kg, \(j=2\), 10 ml/kg, \(j=3\), 30 ml/kg)

Or cell mean model (different ways to write the model):

\[Y_{ijk}=\mu_{ij} + \epsilon_{ijk}.\]

Cell Mean Illustration

Some commonly interested questions and statistical formulation

Suppose the design is balanced and there is no missing data. If we have 4 replicates for each combination of gender and dose levels, then, in total, we have \(2\times 3 \times 4=24\) animals/observations.

There are several contrast we would be interested. For example:

  • Averaged over the two genders, does diet affect this metabolite profile? (Main diet effect)
    • What is the contrast for this question?
    • To check what R output to answer this question?

\[H_0: [\mu + \sum_{i}(\alpha_i+\gamma_{i1})/2+ \beta_1] =[\mu + \sum_{i}(\alpha_i+\gamma_{i2})/2+ \beta_2] = [\mu + \sum_{i}(\alpha_i+\gamma_{i3})/2+ \beta_3]\]

\[H_0: \sum_i(\gamma_{i1})/2 + \beta_1=\sum_i(\gamma_{i2})/2 + \beta_2=\sum_i(\gamma_{i3})/2 + \beta_3 \]

\[\text{Or } H_0: \bar\mu_{.1}=\bar\mu_{.2}=\bar\mu_{.3}\]

  • Does gender have similar effects on animals’ metabolism based on each diet? (Interaction effect)
    • What is the contrast for this question?
    • To check what R output to answer this question?

\[H_0: (\mu+\alpha_1 + \beta_1 + \gamma_{11}) - (\mu+\alpha_2 + \beta_1 + \gamma_{21})=(\mu+\alpha_1 + \beta_2 + \gamma_{12}) - (\mu+\alpha_2 + \beta_2 + \gamma_{22})=(\mu+\alpha_1 + \beta_3 + \gamma_{13}) - (\mu+\alpha_2 + \beta_3 + \gamma_{23}) \]

\[H_0: (\alpha_1 + \gamma_{11}) - (\alpha_2 + \gamma_{21})=(\alpha_1 + \gamma_{12}) - (\alpha_2 + \gamma_{22})=(\alpha_1 + \gamma_{13}) - (\alpha_2 + \gamma_{23}) \]

\[\text{Or } H_0: \mu_{11}-\mu_{21}=\mu_{12}-\mu_{22}=\mu_{13}-\mu_{23}\]

  • Is there gender effect when the given diet has high dose of xxxxxx? (Simple gender effect, conditioning on high dose of xxxxxx)
    • What is the contrast for this question?
    • To check what R output to answer this question?

\[H_0: (\mu+\alpha_1 + \beta_3 + \gamma_{13}) - (\mu+\alpha_2 + \beta_3 + \gamma_{23})=0\]

\[H_0: \alpha_1 + \gamma_{13} - \alpha_2 - \gamma_{23}=0\]

\[\text{Or } H_0: \mu_{13}-\mu_{23}=0\]

There are more we can do, for example, pair-wise comparison post-hoc tests.

Model assumptions assesment

Once the linear model is fitted, we typically check residual plots including studentized-residual v.s. predictions plot and Q-Q plot. We need to check (Normality and equal-variance):

  • Are residuals highly skewed? If they are skewed, Normality assumption may not be valid and we may want to try other transformation and/or pre-processing.
  • Are residuals showing some patterns indicating either some within group variability un-explained or lack of fit? This issue is rare for two-way ANOVA in my opinion but it happened, check other transformation and/or try to rule out some extreme within group variation, try to check if there is any outliers etc.
  • If there is not any remedy, non-parametric model would be recommended.

Example 1 with R code

(Revised assignment solutions from STAT471/571 Spring 2021)

The data came from some of Dr. Philip Dixon’s ecological research. He was interested in the distribution of plants in alpine tundra. In one part of his work, he excavated small individuals of two species (Potentilla nivea, which is a small plant that naturally occurs only in dry ridgetops, and Potentilla gracilis, which is a larger plant that naturally occurs only in moist sites) and transplanted them to new sites (Dry, Mid, Wet) (without competitors, for the ecologists). This experiment was repeated in two places with similar vegetation but 100 miles apart (Cumberland Pass and Pennsylvania Mtn.). For now, we will only consider the data from one site, Cumberland Pass. The two species are: Potentilla gracillis and Potentilla nivea. He started out with 10 reps per species per site per experiment, but some seedlings died, so the number that he could harvest three years later varied between 5 and 10. The response he measured was the dry mass of this year’s growth (in mg). Because the within-group variance varied considerably between groups, he log transformed the biomass. This has already been done for you. The variables in the data file are: Experiment, Species, Site, and log biomass. (Experiment can be ignored.)

# Load packages
library(car)
library(emmeans)
library(ggpubr)
library(tidyverse)

# read data 
data_path="path to your data file"
dat=read.table(paste0(data_path,"/data_file_name.txt"),header=T)
str(dat)

Fit two-way ANOVA model

Two_Way_ANOVA = lm(lnwt ~ site*spp, data = dat, 
                   contrasts = list(spp="contr.sum", site="contr.sum"))
Anova(Two_Way_ANOVA, type = 3)
## Anova Table (Type III tests)
## 
## Response: lnwt
##             Sum Sq Df   F value    Pr(>F)    
## (Intercept) 942.80  1 1714.9526 < 2.2e-16 ***
## site          0.77  2    0.6979  0.502085    
## spp           6.63  1   12.0540  0.001025 ** 
## site:spp     35.01  2   31.8393 7.342e-10 ***
## Residuals    29.69 54                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Averaged over the two species, did site affect the growth?

According to the ANOVA table, the main effect of site on growth is not statistically significant with a p-value of about 0.5.

Did site have similar effects on the growth of each species?

Since there is strong evidence that species and site interact (p-value of 7.34e-10), the site effect differs depending on species.

Plot the cell means (for each combination of species and site).

Use this plot and the test results to answer the two biological questions.

At Cumberland Pass, the estimated mean level of the log year’s growth for Potentilla gracilis was in the order of increasing wetness of the site. While for Potentilla nivea, the growth decreased as they were transplanted to wetter sites. Marginally, the site main effect is not significant, likely due to the very strong effect of interaction between Site and Species. (This also provides evidence to check simple effects.)

Checking simple effects

#by Species
siteByspp = emmeans(Two_Way_ANOVA, ~site|spp)
siteByspp # cell means
## spp = grac:
##  site emmean    SE df lower.CL upper.CL
##  Dry    3.10 0.234 54     2.63     3.57
##  Mid    4.66 0.234 54     4.19     5.13
##  Wet    5.13 0.234 54     4.66     5.60
## 
## spp = nive:
##  site emmean    SE df lower.CL upper.CL
##  Dry    4.51 0.234 54     4.04     4.98
##  Mid    3.49 0.234 54     3.02     3.96
##  Wet    2.90 0.234 54     2.43     3.37
## 
## Confidence level used: 0.95
pairs(siteByspp)
## spp = grac:
##  contrast  estimate    SE df t.ratio p.value
##  Dry - Mid   -1.555 0.332 54  -4.690  0.0001
##  Dry - Wet   -2.022 0.332 54  -6.098  <.0001
##  Mid - Wet   -0.467 0.332 54  -1.408  0.3438
## 
## spp = nive:
##  contrast  estimate    SE df t.ratio p.value
##  Dry - Mid    1.026 0.332 54   3.094  0.0086
##  Dry - Wet    1.615 0.332 54   4.871  <.0001
##  Mid - Wet    0.589 0.332 54   1.776  0.1872
## 
## P value adjustment: tukey method for comparing a family of 3 estimates
contrast(siteByspp, list("Wet - Dry"=c(-1,0,1))) # manually 
## spp = grac:
##  contrast  estimate    SE df t.ratio p.value
##  Wet - Dry     2.02 0.332 54   6.098  <.0001
## 
## spp = nive:
##  contrast  estimate    SE df t.ratio p.value
##  Wet - Dry    -1.61 0.332 54  -4.871  <.0001
#by Sites
sppBysite = emmeans(Two_Way_ANOVA, ~spp|site)
sppBysite # cell means
## site = Dry:
##  spp  emmean    SE df lower.CL upper.CL
##  grac   3.10 0.234 54     2.63     3.57
##  nive   4.51 0.234 54     4.04     4.98
## 
## site = Mid:
##  spp  emmean    SE df lower.CL upper.CL
##  grac   4.66 0.234 54     4.19     5.13
##  nive   3.49 0.234 54     3.02     3.96
## 
## site = Wet:
##  spp  emmean    SE df lower.CL upper.CL
##  grac   5.13 0.234 54     4.66     5.60
##  nive   2.90 0.234 54     2.43     3.37
## 
## Confidence level used: 0.95
pairs(sppBysite) 
## site = Dry:
##  contrast    estimate    SE df t.ratio p.value
##  grac - nive    -1.41 0.332 54  -4.246  0.0001
## 
## site = Mid:
##  contrast    estimate    SE df t.ratio p.value
##  grac - nive     1.17 0.332 54   3.538  0.0008
## 
## site = Wet:
##  contrast    estimate    SE df t.ratio p.value
##  grac - nive     2.23 0.332 54   6.722  <.0001
contrast(sppBysite, list("nive - grac"=c(-1,1))) # manually 
## site = Dry:
##  contrast    estimate    SE df t.ratio p.value
##  nive - grac     1.41 0.332 54   4.246  0.0001
## 
## site = Mid:
##  contrast    estimate    SE df t.ratio p.value
##  nive - grac    -1.17 0.332 54  -3.538  0.0008
## 
## site = Wet:
##  contrast    estimate    SE df t.ratio p.value
##  nive - grac    -2.23 0.332 54  -6.722  <.0001
#pwpp(sppBysite) # simple effect

For Potentilla gracilis, the estimated log growth at wet site is siginficant larger than dry site. For Potentilla nivea, there are significant decreases in the log growth at both mid and wet sites, compared to the dry site.

Run pairwise comparisons of each of the treatment factors

emm.Site=emmeans(Two_Way_ANOVA, ~site) # marginal effect
emm.Site
##  site emmean    SE df lower.CL upper.CL
##  Dry    3.81 0.166 54     3.48     4.14
##  Mid    4.07 0.166 54     3.74     4.40
##  Wet    4.01 0.166 54     3.68     4.34
## 
## Results are averaged over the levels of: spp 
## Confidence level used: 0.95
pwpp(emm.Site) + theme_light(base_size = 14)

At the significant level of 0.05, none are significant.

Model assumptions

Residuals v.s. fitted

Several different ways to plot residuals.

# raw residuals
ggplot(Two_Way_ANOVA, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0)

# Studentized residuals
residualPlots(Two_Way_ANOVA, type="rstudent")

Q-Q plot

ggplot(Two_Way_ANOVA, aes(sample = rstandard(Two_Way_ANOVA))) + geom_qq() + stat_qq_line()+ylab("Studentized Residuals Qunatiles") + xlab("Normal Qunatiles")

Conclusion

Both plots suggest the model fits the data well, we don’t have much concerns.

Example 2 with R code, metabolomic data

Data is collected from a project about sorghum growth, investigated by Dr. Peng Liu and others. Root tissues were collected on several dates for transcript and metabolite profiling. Two factors are potentially of interest, genotype and date.

Data Pre-processing: log transformed, pareto scaled. It is not clear how the transformation and a scaling method influence each other with regard to the complex metabolomics data, while we used ranked based method before so such pre-processing does not matter much. And applying scaling on the data as whole or on subsets of the data separately depends on the question of interest.

Fit model and check assumptions

head(dat[,c(1:5)])
##   Genotype DateNum      C001       C002       C003
## 1      E14  Date 2 -1.351642 -0.8916674 -1.0232630
## 2      E14  Date 2 -1.028037 -0.9230639 -0.4811451
## 3      E14  Date 2 -2.551445 -1.7325267 -2.2414555
## 4      E14  Date 2 -1.349161 -0.5770386 -1.6047841
## 5      E14  Date 2 -1.706522 -0.6796374 -1.5688942
## 6      E14  Date 1 -1.764073 -0.8721830 -1.5267434
# fit model
lm_mod=lm(C019~Genotype*DateNum, data=dat)
# check assumptions
residualPlots(lm_mod, type="rstudent")

ggplot(lm_mod, aes(sample = rstandard(lm_mod))) + geom_qq() + stat_qq_line()+ylab("Studentized Residuals Qunatiles") + xlab("Normal Qunatiles")

ANOVA table

Anova(lm_mod, type = 3)
## Anova Table (Type III tests)
## 
## Response: C019
##                  Sum Sq Df F value Pr(>F)
## (Intercept)       2.168  1  1.9550 0.1696
## Genotype          0.161  2  0.0726 0.9301
## DateNum           0.220  2  0.0992 0.9058
## Genotype:DateNum  0.595  4  0.1342 0.9689
## Residuals        45.458 41

Integrative analysis

dim(dat) # 50 samples, 343-2 metabolites
## [1]  50 343
# get Metabolites column number
OTUs=c(3:343)
list_out=lapply(OTUs, function(otu){
  lm_temp=lm(dat[,otu]~ dat$Genotype*dat$DateNum )
  aov_temp=Anova(lm_temp, type=3)
  results=aov_temp$'Pr(>F)'[2:4]
  names(results)=c("Genotype","Date","Interaction")
  results
})
pvals_table=bind_rows(list_out)
head(pvals_table)
## # A tibble: 6 × 3
##   Genotype  Date Interaction
##      <dbl> <dbl>       <dbl>
## 1    0.853 0.378       0.998
## 2    0.317 0.998       0.683
## 3    0.818 0.351       0.976
## 4    0.706 0.736       0.946
## 5    0.210 0.264       0.671
## 6    0.233 0.807       0.553
qvals_table=as.data.frame(apply(pvals_table, MARGIN=2, FUN = function(pv){p.adjust(pv,"BH")}))
colnames(qvals_table)=colnames(pvals_table)
head(qvals_table)
##    Genotype      Date Interaction
## 1 0.9750494 0.9982908   0.9984227
## 2 0.9750494 0.9982908   0.9984227
## 3 0.9750494 0.9982908   0.9984227
## 4 0.9750494 0.9982908   0.9984227
## 5 0.9750494 0.9982908   0.9984227
## 6 0.9750494 0.9982908   0.9984227
OTUs_names=colnames(dat)[-c(1:2)]
print(paste0("OTUs show differential abundance over genotype: ", OTUs_names[which(qvals_table$Genotype<=0.05)],", FDR controlled at 0.05."))
## [1] "OTUs show differential abundance over genotype: C0007_17, FDR controlled at 0.05."
## [2] "OTUs show differential abundance over genotype: C0277_17, FDR controlled at 0.05."
## [3] "OTUs show differential abundance over genotype: C0390_17, FDR controlled at 0.05."
## [4] "OTUs show differential abundance over genotype: C0602_17, FDR controlled at 0.05."
## [5] "OTUs show differential abundance over genotype: C0840_17, FDR controlled at 0.05."
## [6] "OTUs show differential abundance over genotype: C1747_17, FDR controlled at 0.05."
## [7] "OTUs show differential abundance over genotype: C2083_17, FDR controlled at 0.05."