1. 多重比较
多重比较法是多个等方差正态总体均值的比较方法。经过方差分析法可以说明各总体均值间的差异是否显著,即只能说明均值不全相等,但不能具体说明哪几个均值之间有显著差异。t检验只能说明两个均值的差异是否显著。比较m个均值,需要单独进行(m/2)=m(m-1)/2次t检验,不但工作量大,而且误差也大。多重比较法可以克服这些缺点。
上面百度百科的定义,通俗一点来讲:
有10个品种的,产量数据,想看一下品种间是否达到显著性差异,分析思路:
- 方差分析,查看品种这个因素是否达到显著性差异
- 如果达到显著性差异,表示品种间至少有一对达到显著性差异,问题是哪一对,或者哪几对?
- 使用多重比较
2. 方差分析aov的多重比较
使用npk数据,进行建模,对block进行多重比较。
载入数据,查看数据:
> ### aov的多重比较
>
> # 1, 载入数据
> data(npk)
>
> # 2,查看数据
> head(npk)
block N P K yield
1 1 0 1 1 49.5
2 1 1 1 0 62.8
3 1 0 0 0 46.8
4 1 1 0 1 57.0
5 2 1 0 0 59.8
6 2 1 1 1 58.5
建模:
> # 3, 建模:yield ~ N + block
> mod1 = aov(yield ~ N + block, data=npk)
> summary(mod1)
Df Sum Sq Mean Sq F value Pr(>F)
N 1 189.3 189.28 9.360 0.0071 **
block 5 343.3 68.66 3.395 0.0262 *
Residuals 17 343.8 20.22
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
多重比较:
> # 4, LSD 多重比较
> library(agricolae)
> re = LSD.test(mod1,"block")
> re
$statistics
MSerror Df Mean CV t.value LSD
20.22284 17 54.875 8.194955 2.109816 6.708889
$parameters
test p.ajusted name.t ntr alpha
Fisher-LSD none block 6 0.05
$means
yield std r LCL UCL Min Max Q25 Q50 Q75
1 54.025 7.269285 4 49.2811 58.7689 46.8 62.8 48.825 53.25 58.450
2 57.450 2.043689 4 52.7061 62.1939 55.5 59.8 55.875 57.25 58.825
3 60.775 6.790373 4 56.0311 65.5189 55.0 69.5 55.600 59.30 64.475
4 50.125 8.150000 4 45.3811 54.8689 44.2 62.0 45.175 47.15 52.100
5 50.525 1.486327 4 45.7811 55.2689 48.8 52.0 49.550 50.65 51.625
6 56.350 2.435159 4 51.6061 61.0939 53.2 59.0 55.300 56.60 57.650
$comparison
NULL
$groups
yield groups
3 60.775 a
2 57.450 ab
6 56.350 abc
1 54.025 bc
5 50.525 c
4 50.125 c
attr(,"class")
[1] "group"
这里,得到的LSD = 6.708889
, 多重比较中,用水平的平均值的差值,与LSD比较,如果大于LSD,则认为两水平达到显著性差异。
比如水平 60.775 - 54.025 > 6.70889
, 所以他们的多重比较标识为a和b。
$groups
yield groups
3 60.775 a
2 57.450 ab
6 56.350 abc
1 54.025 bc
5 50.525 c
4 50.125 c
3. 标准误,差数的标准误,最小显著性差异的计算
- se:标准误
- sed:差数的标准误
- LSD:最小显著性差值
SE的方差:
SED的方差:
LSD的方差
# Standard errors of means
#
# Table block N
# rep. 4 12
# d.f. 17 17
# e.s.e. 2.248 1.298
# Standard errors of differences of means
#
# Table block N
# rep. 4 12
# d.f. 17 17
# s.e.d. 3.180 1.836
# Least significant differences of means (5% level)
#
# Table block N
# rep. 4 12
# d.f. 17 17
# l.s.d. 6.709 3.873
这里因素block的各项值为:
- se = 2.248
- sed = 3.180
- lsd = 6.709
R语言计算的LSD为:
# $statistics
# MSerror Df Mean CV t.value LSD
# 20.22284 17 54.875 8.194955 2.109816 6.708889
LSD的计算方法
包括tvalue和sed
> tvalue = qt((1-0.05/2),17)
> tvalue*3.180
[1] 6.709214
所以,多重比较的关键点在于SED的计算和tvalue的计算。tvalue的计算是0.05和自由度的计算。
4. asreml如何进行多重比较
所以,如果想用asreml进行多重比较,需要计算sed,asreml能够计算两两水平的SED,所以可以手动计算两两水平的LSD,然后就可以对两两水平进行多重比较了。
注意:
- 不平衡数据中,两两水平的SED不同,因此LSD也不同,所以没有统一的LSD
- 判断两两水平的多重比较时,需要用这两个水平的LSD
asreml建模:
> library(asreml)
> mod2 = asreml(yield ~ N +block, data=npk)
ASReml: Tue Oct 15 19:50:37 2019
LogLik S2 DF wall cpu
-39.1127 20.2228 17 19:50:38 0.0
-39.1127 20.2228 17 19:50:38 0.0
Finished on: Tue Oct 15 19:50:38 2019
LogLikelihood Converged
显著性检验:
> wald(mod2)
Wald tests for fixed effects
Response: yield
Terms added sequentially; adjusted for those above
Df Sum of Sq Wald statistic Pr(Chisq)
(Intercept) 1 72270 3573.7 < 2.2e-16 ***
N 1 189 9.4 0.002218 **
block 5 343 17.0 0.004546 **
residual (MS) 20
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
这里,block对应的残差自由度为17,这个数值计算tvalue时有用。
计算SED和predict means
> x = predict(mod2,classify = "block",sed = TRUE,vcov = T)
ASReml: Tue Oct 15 19:50:37 2019
LogLik S2 DF wall cpu
-39.1127 20.2228 17 19:50:38 0.0
-39.1127 20.2228 17 19:50:38 0.0
Finished on: Tue Oct 15 19:50:38 2019
LogLikelihood Converged
>
> x$predictions$pvals #predict_means, se
Notes:
- The predictions are obtained by averaging across the hypertable
calculated from model terms constructed solely from factors in
the averaging and classify sets.
- Use "average" to move ignored factors into the averaging set.
- The SIMPLE averaging set: N
block predicted.value standard.error est.status
1 1 54.025 2.248491 Estimable
2 2 57.450 2.248491 Estimable
3 3 60.775 2.248491 Estimable
4 4 50.125 2.248491 Estimable
5 5 50.525 2.248491 Estimable
6 6 56.350 2.248491 Estimable
> x$predictions$sed #sed
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0.000000 3.179846 3.179846 3.179846 3.179846 3.179846
[2,] 3.179846 0.000000 3.179846 3.179846 3.179846 3.179846
[3,] 3.179846 3.179846 0.000000 3.179846 3.179846 3.179846
[4,] 3.179846 3.179846 3.179846 0.000000 3.179846 3.179846
[5,] 3.179846 3.179846 3.179846 3.179846 0.000000 3.179846
[6,] 3.179846 3.179846 3.179846 3.179846 3.179846 0.000000
这里,
SE = 2.24891
SED = 3.179846
同方差分析结果一致,因为数据时平衡的,如果不平衡,那么水平之间的SED可能会不相等。
计算LSD的值
> tvalue = qt((1-0.05/2),17)
> lsd = tvalue*sed
> lsd
[1] 6.708889
结构和方差分析一致。
5,一个复杂的例子
数据oats
> # 一个复杂的例子
> data(oats)
> head(oats)
Blocks Nitrogen Subplots Variety Wplots yield
1 1 0.6_cwt 1 Marvellous 1 156
2 1 0.4_cwt 2 Marvellous 1 118
3 1 0.2_cwt 3 Marvellous 1 140
4 1 0_cwt 4 Marvellous 1 105
5 1 0_cwt 1 Victory 2 111
6 1 0.2_cwt 2 Victory 2 130
方差分析的多重比较
> m1 = aov(yield ~ Nitrogen +Blocks, data = oats)
> summary(m1)
Df Sum Sq Mean Sq F value Pr(>F)
Nitrogen 3 20021 6674 26.13 4.22e-11 ***
Blocks 5 15875 3175 12.43 2.10e-08 ***
Residuals 63 16090 255
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> re = LSD.test(m1,"Nitrogen")
> re
$statistics
MSerror Df Mean CV t.value LSD
255.3995 63 103.9722 15.37067 1.998341 10.64531
$parameters
test p.ajusted name.t ntr alpha
Fisher-LSD none Nitrogen 4 0.05
$means
yield std r LCL UCL Min Max Q25 Q50 Q75
0.2_cwt 98.88889 21.84407 18 91.36152 106.41626 64 140 83.75 95.0 112.50
0.4_cwt 114.22222 22.31738 18 106.69485 121.74959 81 161 97.75 115.0 124.75
0.6_cwt 123.38889 22.99908 18 115.86152 130.91626 86 174 106.25 121.5 139.00
0_cwt 79.38889 19.39417 18 71.86152 86.91626 53 117 63.25 72.0 94.25
$comparison
NULL
$groups
yield groups
0.6_cwt 123.38889 a
0.4_cwt 114.22222 a
0.2_cwt 98.88889 b
0_cwt 79.38889 c
attr(,"class")
[1] "group"
>
LSD为10.64531
asreml的多重比较
> m2 = asreml(yield ~ Nitrogen +Blocks, data=oats)
ASReml: Tue Oct 15 19:50:37 2019
LogLik S2 DF wall cpu
-39.1127 20.2228 17 19:50:38 0.0
-39.1127 20.2228 17 19:50:38 0.0
Finished on: Tue Oct 15 19:50:38 2019
LogLikelihood Converged
LogLikelihood Converged
> predict(m2,classify = "Nitrogen",sed=T)$predictions$sed
ASReml: Tue Oct 15 16:17:10 2019
LogLik S2 DF wall cpu
-217.1962 255.3995 63 16:17:10 0.0
-217.1962 255.3995 63 16:17:10 0.0
Finished on: Tue Oct 15 16:17:10 2019
LogLikelihood Converged
[,1] [,2] [,3] [,4]
[1,] 0.000000 5.327074 5.327074 5.327074
[2,] 5.327074 0.000000 5.327074 5.327074
[3,] 5.327074 5.327074 0.000000 5.327074
[4,] 5.327074 5.327074 5.327074 0.000000
> wald(m2)
Wald tests for fixed effects
Response: yield
Terms added sequentially; adjusted for those above
Df Sum of Sq Wald statistic Pr(Chisq)
(Intercept) 1 778336 3047.52 < 2.2e-16 ***
Nitrogen 3 20020 78.39 < 2.2e-16 ***
Blocks 5 15875 62.16 4.348e-12 ***
residual (MS) 255
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
这里,tvalue的自由度为62.16(因为有缺失值),sed为5.327074,所以LSD的计算为:
> qt(0.975,62.16)*5.327074
[1] 10.64812
和方差分析的LSD结果一致,然后再手动进行多重比较即可。
6,asreml进行多重比较的说明
- 混合线性模型框架下,可以考虑A矩阵和G矩阵
- 多重比较主要是针对固定因子
7, LSD与T检验
- 一个因素不同水平的比较,和T检验类似,差值除以sed,得到T值,配合自由度,得到pvalue。sed*自由度的tvalue,得到lsd,用lsd和差值比较。