A tutorial for ANOVA analysis of treatments effects on metabolite abundance profile from animal studies.
From my limited understanding on metabolite profile and some statistical experience related to metabolomics data analysis, it is not always necessary to conduct complicated pre-processing of the abundance data. Instead, any pre-processing should be justified ahead either from a purely biological perspective or an integrated statistical and biological perspective. For example, some metabolites’ profile, after controlling technical randomness, can be appropriately assumed to follow a log-normal distribution. In this case, associating a random variable with log-normal distribution to each metabolite profile is good enough. And treatment effects are reflected on mean, while the variance amounts to uncontrolled random noise. Though, in practice, log-normal assumption may not be satisfied. To apply linear model, we can apply transformation. Log-normal assumption involves log-transformation (\(log(X)\)) as well. Other types of transformation include, but not limited to, square (\(X^2\)), cubic (\(X^3\)), inverse (\(1/X\)), square-root (\(\sqrt{X}\)), inverse-square-root (\(1/\sqrt{X}\)) and so on. The key assumption is after transformation, namely \(g()\), \(Y=g(X)\) follows normal distribution.
Now consider there are two factors/treatments/covariates/independent-variables under study, for example, A=(M,F) (e.g. gender) and B=(0,10,30) (e.g. dose), and the response/dependent-variable/end-point is the abundance of one specific metabolite. Assuming the random variable \(X\) associated with the response variable is log-normally distributed, we can formally write the following linear model.
\[log(X_{ijk})=Y_{ijk}=\mu + \alpha_i + \beta_j + \gamma_{ij} + \epsilon_{ijk},\]
where \(Y_{ijk}\) is transformed response of the \(k^{th}\) sample in \(i^{th}\) gender and \(j^{th}\) dose group, \(\epsilon_{ijk}\) is i.i.d. random error. (\(i=1\), Female, \(i=2\), Male; \(j=1\), 0 ml/kg, \(j=2\), 10 ml/kg, \(j=3\), 30 ml/kg)
Or cell mean model (different ways to write the model):
\[Y_{ijk}=\mu_{ij} + \epsilon_{ijk}.\]
Suppose the design is balanced and there is no missing data. If we have 4 replicates for each combination of gender and dose levels, then, in total, we have \(2\times 3 \times 4=24\) animals/observations.
There are several contrast we would be interested. For example:
\[H_0: [\mu + \sum_{i}(\alpha_i+\gamma_{i1})/2+ \beta_1] =[\mu + \sum_{i}(\alpha_i+\gamma_{i2})/2+ \beta_2] = [\mu + \sum_{i}(\alpha_i+\gamma_{i3})/2+ \beta_3]\]
\[H_0: \sum_i(\gamma_{i1})/2 + \beta_1=\sum_i(\gamma_{i2})/2 + \beta_2=\sum_i(\gamma_{i3})/2 + \beta_3 \]
\[\text{Or } H_0: \bar\mu_{.1}=\bar\mu_{.2}=\bar\mu_{.3}\]
\[H_0: (\mu+\alpha_1 + \beta_1 + \gamma_{11}) - (\mu+\alpha_2 + \beta_1 + \gamma_{21})=(\mu+\alpha_1 + \beta_2 + \gamma_{12}) - (\mu+\alpha_2 + \beta_2 + \gamma_{22})=(\mu+\alpha_1 + \beta_3 + \gamma_{13}) - (\mu+\alpha_2 + \beta_3 + \gamma_{23}) \]
\[H_0: (\alpha_1 + \gamma_{11}) - (\alpha_2 + \gamma_{21})=(\alpha_1 + \gamma_{12}) - (\alpha_2 + \gamma_{22})=(\alpha_1 + \gamma_{13}) - (\alpha_2 + \gamma_{23}) \]
\[\text{Or } H_0: \mu_{11}-\mu_{21}=\mu_{12}-\mu_{22}=\mu_{13}-\mu_{23}\]
\[H_0: (\mu+\alpha_1 + \beta_3 + \gamma_{13}) - (\mu+\alpha_2 + \beta_3 + \gamma_{23})=0\]
\[H_0: \alpha_1 + \gamma_{13} - \alpha_2 - \gamma_{23}=0\]
\[\text{Or } H_0: \mu_{13}-\mu_{23}=0\]
There are more we can do, for example, pair-wise comparison post-hoc tests.
Once the linear model is fitted, we typically check residual plots including studentized-residual v.s. predictions plot and Q-Q plot. We need to check (Normality and equal-variance):
(Revised assignment solutions from STAT471/571 Spring 2021)
The data came from some of Dr. Philip Dixon’s ecological research. He was interested in the distribution of plants in alpine tundra. In one part of his work, he excavated small individuals of two species (Potentilla nivea, which is a small plant that naturally occurs only in dry ridgetops, and Potentilla gracilis, which is a larger plant that naturally occurs only in moist sites) and transplanted them to new sites (Dry, Mid, Wet) (without competitors, for the ecologists). This experiment was repeated in two places with similar vegetation but 100 miles apart (Cumberland Pass and Pennsylvania Mtn.). For now, we will only consider the data from one site, Cumberland Pass. The two species are: Potentilla gracillis and Potentilla nivea. He started out with 10 reps per species per site per experiment, but some seedlings died, so the number that he could harvest three years later varied between 5 and 10. The response he measured was the dry mass of this year’s growth (in mg). Because the within-group variance varied considerably between groups, he log transformed the biomass. This has already been done for you. The variables in the data file are: Experiment, Species, Site, and log biomass. (Experiment can be ignored.)
# Load packages
library(car)
library(emmeans)
library(ggpubr)
library(tidyverse)
# read data
data_path="path to your data file"
dat=read.table(paste0(data_path,"/data_file_name.txt"),header=T)
str(dat)
Two_Way_ANOVA = lm(lnwt ~ site*spp, data = dat,
contrasts = list(spp="contr.sum", site="contr.sum"))
Anova(Two_Way_ANOVA, type = 3)
## Anova Table (Type III tests)
##
## Response: lnwt
## Sum Sq Df F value Pr(>F)
## (Intercept) 942.80 1 1714.9526 < 2.2e-16 ***
## site 0.77 2 0.6979 0.502085
## spp 6.63 1 12.0540 0.001025 **
## site:spp 35.01 2 31.8393 7.342e-10 ***
## Residuals 29.69 54
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the ANOVA table, the main effect of site on growth is not statistically significant with a p-value of about 0.5.
Since there is strong evidence that species and site interact (p-value of 7.34e-10), the site effect differs depending on species.
Use this plot and the test results to answer the two biological questions.
At Cumberland Pass, the estimated mean level of the log year’s growth for Potentilla gracilis was in the order of increasing wetness of the site. While for Potentilla nivea, the growth decreased as they were transplanted to wetter sites. Marginally, the site main effect is not significant, likely due to the very strong effect of interaction between Site and Species. (This also provides evidence to check simple effects.)
#by Species
siteByspp = emmeans(Two_Way_ANOVA, ~site|spp)
siteByspp # cell means
## spp = grac:
## site emmean SE df lower.CL upper.CL
## Dry 3.10 0.234 54 2.63 3.57
## Mid 4.66 0.234 54 4.19 5.13
## Wet 5.13 0.234 54 4.66 5.60
##
## spp = nive:
## site emmean SE df lower.CL upper.CL
## Dry 4.51 0.234 54 4.04 4.98
## Mid 3.49 0.234 54 3.02 3.96
## Wet 2.90 0.234 54 2.43 3.37
##
## Confidence level used: 0.95
pairs(siteByspp)
## spp = grac:
## contrast estimate SE df t.ratio p.value
## Dry - Mid -1.555 0.332 54 -4.690 0.0001
## Dry - Wet -2.022 0.332 54 -6.098 <.0001
## Mid - Wet -0.467 0.332 54 -1.408 0.3438
##
## spp = nive:
## contrast estimate SE df t.ratio p.value
## Dry - Mid 1.026 0.332 54 3.094 0.0086
## Dry - Wet 1.615 0.332 54 4.871 <.0001
## Mid - Wet 0.589 0.332 54 1.776 0.1872
##
## P value adjustment: tukey method for comparing a family of 3 estimates
contrast(siteByspp, list("Wet - Dry"=c(-1,0,1))) # manually
## spp = grac:
## contrast estimate SE df t.ratio p.value
## Wet - Dry 2.02 0.332 54 6.098 <.0001
##
## spp = nive:
## contrast estimate SE df t.ratio p.value
## Wet - Dry -1.61 0.332 54 -4.871 <.0001
#by Sites
sppBysite = emmeans(Two_Way_ANOVA, ~spp|site)
sppBysite # cell means
## site = Dry:
## spp emmean SE df lower.CL upper.CL
## grac 3.10 0.234 54 2.63 3.57
## nive 4.51 0.234 54 4.04 4.98
##
## site = Mid:
## spp emmean SE df lower.CL upper.CL
## grac 4.66 0.234 54 4.19 5.13
## nive 3.49 0.234 54 3.02 3.96
##
## site = Wet:
## spp emmean SE df lower.CL upper.CL
## grac 5.13 0.234 54 4.66 5.60
## nive 2.90 0.234 54 2.43 3.37
##
## Confidence level used: 0.95
pairs(sppBysite)
## site = Dry:
## contrast estimate SE df t.ratio p.value
## grac - nive -1.41 0.332 54 -4.246 0.0001
##
## site = Mid:
## contrast estimate SE df t.ratio p.value
## grac - nive 1.17 0.332 54 3.538 0.0008
##
## site = Wet:
## contrast estimate SE df t.ratio p.value
## grac - nive 2.23 0.332 54 6.722 <.0001
contrast(sppBysite, list("nive - grac"=c(-1,1))) # manually
## site = Dry:
## contrast estimate SE df t.ratio p.value
## nive - grac 1.41 0.332 54 4.246 0.0001
##
## site = Mid:
## contrast estimate SE df t.ratio p.value
## nive - grac -1.17 0.332 54 -3.538 0.0008
##
## site = Wet:
## contrast estimate SE df t.ratio p.value
## nive - grac -2.23 0.332 54 -6.722 <.0001
#pwpp(sppBysite) # simple effect
For Potentilla gracilis, the estimated log growth at wet site is siginficant larger than dry site. For Potentilla nivea, there are significant decreases in the log growth at both mid and wet sites, compared to the dry site.
emm.Site=emmeans(Two_Way_ANOVA, ~site) # marginal effect
emm.Site
## site emmean SE df lower.CL upper.CL
## Dry 3.81 0.166 54 3.48 4.14
## Mid 4.07 0.166 54 3.74 4.40
## Wet 4.01 0.166 54 3.68 4.34
##
## Results are averaged over the levels of: spp
## Confidence level used: 0.95
pwpp(emm.Site) + theme_light(base_size = 14)
At the significant level of 0.05, none are significant.
Several different ways to plot residuals.
# raw residuals
ggplot(Two_Way_ANOVA, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0)
# Studentized residuals
residualPlots(Two_Way_ANOVA, type="rstudent")
ggplot(Two_Way_ANOVA, aes(sample = rstandard(Two_Way_ANOVA))) + geom_qq() + stat_qq_line()+ylab("Studentized Residuals Qunatiles") + xlab("Normal Qunatiles")
Both plots suggest the model fits the data well, we don’t have much concerns.
Data is collected from a project about sorghum growth, investigated by Dr. Peng Liu and others. Root tissues were collected on several dates for transcript and metabolite profiling. Two factors are potentially of interest, genotype and date.
Data Pre-processing: log transformed, pareto scaled. It is not clear how the transformation and a scaling method influence each other with regard to the complex metabolomics data, while we used ranked based method before so such pre-processing does not matter much. And applying scaling on the data as whole or on subsets of the data separately depends on the question of interest.
head(dat[,c(1:5)])
## Genotype DateNum C001 C002 C003
## 1 E14 Date 2 -1.351642 -0.8916674 -1.0232630
## 2 E14 Date 2 -1.028037 -0.9230639 -0.4811451
## 3 E14 Date 2 -2.551445 -1.7325267 -2.2414555
## 4 E14 Date 2 -1.349161 -0.5770386 -1.6047841
## 5 E14 Date 2 -1.706522 -0.6796374 -1.5688942
## 6 E14 Date 1 -1.764073 -0.8721830 -1.5267434
# fit model
lm_mod=lm(C019~Genotype*DateNum, data=dat)
# check assumptions
residualPlots(lm_mod, type="rstudent")
ggplot(lm_mod, aes(sample = rstandard(lm_mod))) + geom_qq() + stat_qq_line()+ylab("Studentized Residuals Qunatiles") + xlab("Normal Qunatiles")
Anova(lm_mod, type = 3)
## Anova Table (Type III tests)
##
## Response: C019
## Sum Sq Df F value Pr(>F)
## (Intercept) 2.168 1 1.9550 0.1696
## Genotype 0.161 2 0.0726 0.9301
## DateNum 0.220 2 0.0992 0.9058
## Genotype:DateNum 0.595 4 0.1342 0.9689
## Residuals 45.458 41
dim(dat) # 50 samples, 343-2 metabolites
## [1] 50 343
# get Metabolites column number
OTUs=c(3:343)
list_out=lapply(OTUs, function(otu){
lm_temp=lm(dat[,otu]~ dat$Genotype*dat$DateNum )
aov_temp=Anova(lm_temp, type=3)
results=aov_temp$'Pr(>F)'[2:4]
names(results)=c("Genotype","Date","Interaction")
results
})
pvals_table=bind_rows(list_out)
head(pvals_table)
## # A tibble: 6 × 3
## Genotype Date Interaction
## <dbl> <dbl> <dbl>
## 1 0.853 0.378 0.998
## 2 0.317 0.998 0.683
## 3 0.818 0.351 0.976
## 4 0.706 0.736 0.946
## 5 0.210 0.264 0.671
## 6 0.233 0.807 0.553
qvals_table=as.data.frame(apply(pvals_table, MARGIN=2, FUN = function(pv){p.adjust(pv,"BH")}))
colnames(qvals_table)=colnames(pvals_table)
head(qvals_table)
## Genotype Date Interaction
## 1 0.9750494 0.9982908 0.9984227
## 2 0.9750494 0.9982908 0.9984227
## 3 0.9750494 0.9982908 0.9984227
## 4 0.9750494 0.9982908 0.9984227
## 5 0.9750494 0.9982908 0.9984227
## 6 0.9750494 0.9982908 0.9984227
OTUs_names=colnames(dat)[-c(1:2)]
print(paste0("OTUs show differential abundance over genotype: ", OTUs_names[which(qvals_table$Genotype<=0.05)],", FDR controlled at 0.05."))
## [1] "OTUs show differential abundance over genotype: C0007_17, FDR controlled at 0.05."
## [2] "OTUs show differential abundance over genotype: C0277_17, FDR controlled at 0.05."
## [3] "OTUs show differential abundance over genotype: C0390_17, FDR controlled at 0.05."
## [4] "OTUs show differential abundance over genotype: C0602_17, FDR controlled at 0.05."
## [5] "OTUs show differential abundance over genotype: C0840_17, FDR controlled at 0.05."
## [6] "OTUs show differential abundance over genotype: C1747_17, FDR controlled at 0.05."
## [7] "OTUs show differential abundance over genotype: C2083_17, FDR controlled at 0.05."