R语言anova进阶

KJY / 2021-05-17


R语言anova进阶

参考:

https://www.r-bloggers.com/2022/05/one-way-anova-example-in-r-quick-guide/

https://www.r-bloggers.com/2022/05/two-way-anova-example-in-r-quick-guide/

one way anova

One way ANOVA Example in R, the one-way analysis of variance (ANOVA), also known as one-factor ANOVA, is an extension of the independent two-sample t-test for comparing means when more than two groups are present.

You can use the t-test if you just have two groups. The F-test and the t-test are equivalent in this scenario.

Assume we have three groups to compare (A, B, and C):

Calculate the common variance, often known as residual variance or variance within samples (S2within).

首先计算的是common variance,residual variance

Calculate the difference in sample means as follows:

Calculate the average of each group.

Calculate the difference in sample means (S2between)

As the ratio of S2between/S2within, calculate the F-statistic.

原来F测试是测量组间和组内差异的

Note that a lower ratio (ratio 1) suggests that the means of the samples being compared are not significantly different.

A greater ratio, on the other hand, indicates that the differences in group means are significant.

data <- PlantGrowth

set.seed(123)
dplyr::sample_n(data, 10)
##    weight group
## 1    5.87  trt1
## 2    4.32  trt1
## 3    3.59  trt1
## 4    5.18  ctrl
## 5    5.14  ctrl
## 6    4.89  trt1
## 7    5.12  trt2
## 8    4.81  trt1
## 9    4.50  ctrl
## 10   4.69  trt1
levels(data$group)
## [1] "ctrl" "trt1" "trt2"
data$group <- ordered(data$group,
                         levels = c("ctrl", "trt1", "trt2"))
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
group_by(data, group) %>%
  summarise(
    count = n(),
    mean = mean(weight, na.rm = TRUE),
    sd = sd(weight, na.rm = TRUE)
  )
## # A tibble: 3 x 4
##   group count  mean    sd
##   <ord> <int> <dbl> <dbl>
## 1 ctrl     10  5.03 0.583
## 2 trt1     10  4.66 0.794
## 3 trt2     10  5.53 0.443

可视化数据

library("ggpubr")
## Loading required package: ggplot2
ggboxplot(data, x = "group", y = "weight",
          color = "group", palette = c("#00AFBB", "#E7B800", "#FC4E07"),
          order = c("ctrl", "trt1", "trt2"),
          ylab = "Weight", xlab = "Treatment")

很方便的加上mean和se

library("ggpubr")
ggline(data, x = "group", y = "weight",
       add = c("mean_se", "jitter"),
       order = c("ctrl", "trt1", "trt2"),
       ylab = "Weight", xlab = "Treatment")

进行anova测试

res.aov <- aov(weight ~ group, data = data)

summary(res.aov)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The columns F value and Pr(>F) in the output corresponding to the p-value of the test.

The results of one-way ANOVA testing should be interpreted.

We can conclude that there are significant differences between the groups highlighted with “*” in the model summary because the p-value is less than the significance level of 0.05.

但是这只能告诉你他们之间有区别,但是没有告诉你具体是那两组之间有区别

We can compute Tukey HSD (Tukey Honest Significant Differences, R function: TukeyHSD()) for doing numerous pairwise comparisons between the means of groups because the ANOVA test is significant.

TukeyHSD(res.aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ group, data = data)
## 
## $group
##             diff        lwr       upr     p adj
## trt1-ctrl -0.371 -1.0622161 0.3202161 0.3908711
## trt2-ctrl  0.494 -0.1972161 1.1852161 0.1979960
## trt2-trt1  0.865  0.1737839 1.5562161 0.0120064

lwr, upr: the lower and upper-end points of the confidence interval at 95 percent (default) p adj: p-value after multiple comparisons adjustment.

Only the difference between trt2 and trt1 is significant, as shown by the output, with an adjusted p-value of 0.012.

The function glht() [in the multcomp package] can be used to do multiple comparison processes for an ANOVA. General linear hypothesis tests are abbreviated as glht. The following is a simplified format:

glht(model, lincft)

model: a model that has been fitted, such as an object returned by aov ().

lincft() specifies the linear hypotheses that will be tested. Objects provided from the function mcp are used to specify multiple comparisons in ANOVA models ().

library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
## 
##     geyser
summary(glht(res.aov, linfct = mcp(group = "Tukey")))
## 
##   Simultaneous Tests for General Linear Hypotheses
## 
## Multiple Comparisons of Means: Tukey Contrasts
## 
## 
## Fit: aov(formula = weight ~ group, data = data)
## 
## Linear Hypotheses:
##                  Estimate Std. Error t value Pr(>|t|)  
## trt1 - ctrl == 0  -0.3710     0.2788  -1.331   0.3908  
## trt2 - ctrl == 0   0.4940     0.2788   1.772   0.1979  
## trt2 - trt1 == 0   0.8650     0.2788   3.103   0.0121 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Adjusted p values reported -- single-step method)

pairwise.t.test的结果和anova的结果还是不一样的。

pairwise.t.test(data$weight, data$group,
                 p.adjust.method = "BH")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$weight and data$group 
## 
##      ctrl  trt1 
## trt1 0.194 -    
## trt2 0.132 0.013
## 
## P value adjustment method: BH

The ANOVA test assumes that the data are normally distributed and that group variance is uniform. With certain diagnostic plots, we can verify this.

group variance应该是一样的

plot(res.aov, 1)

two way anova

two-way ANOVA test is used to compare the effects of two grouping variables (A and B) on a response variable at the same time.

two-way ANOVA的假设:

There is no difference in factor A’s means.

There is no difference in factor B’s means.

Factors A and B do not interact in any way.

two-way ANOVA assumes that the observations within each cell are normally distributed with equal variances.

data <- ToothGrowth

str(data)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

R treats “dose” as a numeric variable based on the output. As follows, we’ll transform it to a factor variable (i.e., grouping variable).

data$dose <- factor(data$dose,
                  levels = c(0.5, 1, 2),
                  labels = c("D0.5", "D1", "D2"))
str(data)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 3 levels "D0.5","D1","D2": 1 1 1 1 1 1 1 1 1 1 ...

Make frequency tables by:

table(data$supp,data$dose)
##     
##      D0.5 D1 D2
##   OJ   10 10 10
##   VC   10 10 10
sample_n(data, 10)
##     len supp dose
## 1  25.8   OJ   D1
## 2  32.5   VC   D2
## 3  15.2   OJ D0.5
## 4  17.3   VC   D1
## 5  26.4   OJ   D2
## 6  29.5   VC   D2
## 7  10.0   VC D0.5
## 8  23.6   OJ   D1
## 9  11.2   VC D0.5
## 10 18.5   VC   D2
ggboxplot(data, x = "dose", y = "len", color = "supp",
          palette = c("#00AFBB", "#E7B800"))

ggline(data, x = "dose", y = "len", color = "supp",
       add = c("mean_se", "dotplot"),
       palette = c("#00AFBB", "#E7B800"))
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

res.aov2 <- aov(len ~ supp + dose, data = data)
summary(res.aov2)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4   14.02 0.000429 ***
## dose         2 2426.4  1213.2   82.81  < 2e-16 ***
## Residuals   56  820.4    14.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We may deduce from the ANOVA table that both supp and dose are statistically significant. The most important factor variable is dosage.

These findings led us to anticipate that modifying the delivery technique (supp) or the vitamin C dose will have a major impact on the mean tooth length.

res.aov3 <- aov(len ~ supp * dose, data = data)
summary(res.aov3)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
res.aov4 <- aov(len ~ supp + dose + supp * dose, data = data)
summary(res.aov3)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## supp         1  205.4   205.4  15.572 0.000231 ***
## dose         2 2426.4  1213.2  92.000  < 2e-16 ***
## supp:dose    2  108.3    54.2   4.107 0.021860 *  
## Residuals   54  712.1    13.2                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

以上两种写法是一样的。

Supp has a p-value <0.05 (significant), indicating that varied levels of supp are associated with varying tooth lengths.

The dose p-value <0.05 (significant), indicating that differing treatment levels are linked with substantial differences in tooth length.

The interaction between supp*dose has a p-value of 0.02 (significant), indicating that the connection between dose and tooth length is influenced by the supp technique.

TukeyHSD(res.aov3, which = "dose")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = data)
## 
## $dose
##           diff       lwr       upr   p adj
## D1-D0.5  9.130  6.362488 11.897512 0.0e+00
## D2-D0.5 15.495 12.727488 18.262512 0.0e+00
## D2-D1    6.365  3.597488  9.132512 2.7e-06
TukeyHSD(res.aov3, which = "supp")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = len ~ supp * dose, data = data)
## 
## $supp
##       diff       lwr       upr     p adj
## VC-OJ -3.7 -5.579828 -1.820172 0.0002312

最后一次修改于 2021-05-17