1. 魂断蓝桥的起因

一个两天的bug,终于解决了,竟然是很基础的bug,花费了我两天时间,不免感叹:为什么相爱的人不能在一起。

事情是这个样子的:

最近测试一个R包breedR的动物模型的功能,用了它的测试数据:

> library(breedR)
> data("globulus")
> ped <- globulus[,1:3]
> str(ped)
'data.frame':	1021 obs. of  3 variables:
 $ self: int  69 70 71 72 73 74 75 76 77 78 ...
 $ dad : int  0 0 0 0 0 0 0 0 0 4 ...
 $ mum : int  64 41 56 55 22 50 67 59 49 8 ...
> res <- remlf90(  fixed = phe_X ~ gg,genetic = list(model = 'add_animal',pedig = ped,id = 'self'),data = globulus)
Using default initial variances given by default_initial_variance()
See ?breedR.getOption.

> summary(res)
Formula: phe_X ~ 0 + gg + pedigree 
   Data: globulus 
  AIC  BIC logLik
 5799 5809  -2898

Parameters of special components:


Variance components:
         Estimated variances  S.E.
genetic                3.397 1.595
Residual              14.453 1.529

             Estimate    S.E.
Heritability   0.1887 0.08705

这里,Va为3.39,Ve为14.45,然后我使用asreml-r作为对比:

> library(asreml)
> head(dd)
  self dad mum gen gg bl  phe_X  x y
1   69   0  64   1 14 13 15.756  0 0
2   70   0  41   1  4 13 11.141  3 0
3   71   0  56   1 14 13 19.258  6 0
4   72   0  55   1 14 13  4.775  9 0
5   73   0  22   1  8 13 19.099 12 0
6   74   0  50   1 14 13 19.258 15 0
> dd$self = as.factor(dd$self)
> ainv = asreml.Ainverse(ped)$ginv
> mod1.as = asreml(phe_X ~ gg , random = ~ ped(self),ginverse = list(self = ainv), data=dd) 
LogLikelihood Converged 
> summary(mod1.as)$varcomp
                  gamma component std.error  z.ratio constraint
ped(self)!ped 0.2349996  3.396488  1.595445 2.128865   Positive
R!variance    1.0000000 14.453164  1.529262 9.451070   Positive

结果是一样的。

2. 当你以为一帆风顺时,生活来了

于是我用另外一个数据集,进行测试,数据是使用的我编写的R包:learnasreml中的数据:

> library(learnasreml)
> dat = animalmodel.dat
> ped = animalmodel.ped
> # asreml
> ainv = asreml.Ainverse(ped)$ginv
> mod2.as = asreml(BWT ~ SEX, random = ~ ped(ANIMAL), ginverse = list(ANIMAL = ainv), data=dat)
LogLikelihood Converged 
> summary(mod2.as)$varcomp
                    gamma component std.error   z.ratio constraint
ped(ANIMAL)!ped 0.2160062  2.494254 0.9180669  2.716855   Positive
R!variance      1.0000000 11.547140 0.9386043 12.302458   Positive

方差组分Va为2.49,Ve为11.54。

使用breedR进行测试:

> dd2 = dat
> mod2.br = remlf90(BWT ~ SEX, genetic =  list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree 
   Data: dd2 
  AIC  BIC logLik
 5941 5951  -2968
Parameters of special components:
Variance components:
         Estimated variances   S.E.
genetic               0.6651 0.6728
Residual             13.3380 0.8602

纳尼?方差组分Va为0.66,Ve为13.33,这是什么鬼?和asreml不一样,据我对asreml的熟练程度,只有一种可能:那肯定是breedR有错误。

3. 你大爷永远是你大爷

于是我找到breedR的github中的issue:
上面问题描述:

Hi there,
I recently tried to fit an animal model using the remlf90() function. My model was simple and contained 4 fixed effects, 1 random non-genetic effect and the genetic additive effect (pedigree). I compared the results (h2 + se) to those of BLUPF90 (airemlf90) and they were the same as they should be. Then, I changed the class of the ‘id’ variable in the genetic part of the model from integer to factor and I re-ran the model. The h2 was considerabley different from that I got when the ‘id’ was of class integer.

dat399 a n i m a l < − a s . i n t e r g e r ( d a t 399 animal <- as.interger(dat399 animal<as.interger(dat399animal) # h2 = 0.44, se = 0.012 (correct)
dat399 a n i m a l < − f a c t o r ( d a t 399 animal <- factor(dat399 animal<factor(dat399animal) # h2 = 0.08, se = 0.006 (wrong)
dat399 a n i m a l < − a s . i n t e r g e r ( a s . c h a r a c t e r ( d a t 399 animal <- as.interger(as.character(dat399 animal<as.interger(as.character(dat399animal)) # h2 = 0.44, se = 0.012 (correct again)

So, is this normal, should the ‘id’ part of the genetic effect be always coded as integer or there is a bug that needs to be corrected?

作者回答,breedR需要个体的ID是数字型,如果是因子的话,会报错提醒啊。。。

Hi Nabeel.
Yes. The variables encoding individuals (i.e., id and progenitors) should be integers.
However, the pedigree-building function should have raisen an error whenever the user tries with a factor.
How did you specify the pedigree in the model?
Thanks for your report and help.

然后作者问了一句开发者经常问道的问题:你这个bug是如何得到的。。。

然而,有时候个体是factor时,真的没有报错,我也想提交一个issue,算了,还是自己解决吧!

我就把因子转化为了数字,运行breedR:

> dd2$ANIMAL = as.numeric(dd2$ANIMAL)
> mod2.br = remlf90(BWT ~ SEX, genetic =  list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree 
   Data: dd2 
  AIC  BIC logLik
 5941 5951  -2968

Parameters of special components:

Variance components:
         Estimated variances   S.E.
genetic               0.6651 0.6728
Residual             13.3380 0.8602

这个。。。也太真实了,结果不变,依旧是错误的。

4. 黯然销魂掌

于是,我陷入了深深的职业自我怀疑中:我是爱它的,为什么相爱的人不能在一起?

我左看右看,上看下看,还是没有找到问题的所在,我翻遍了breedR的issue,发现了这么一句话:

Regarding the issue of the variable type of the animal id, note that in the genetic component, the pedigree is taken from ped399, where the variable animal is presumably integer or numeric. However, if you change the corresponding variable in dat399 as you have been doing, this breaks the correspondence between the animal codes in the pedigree and the dataset.

大意就是说,breedR中,系谱和个体ID需要是数字,因为系谱的数据会在breedR中重新编码,如果你改变了数据中ID的编码,那么系谱构建的矩阵就和数据中的ID对应不了,结果就可能是错误的。

这一段正确的废话,并没有激起我什么想法,我还是继续沉浸于深深的自我怀疑中,一定是我不够好,所以它才想要逃。。。

5. 梦里传来你的呼唤

灵感总是在梦中醒来,半夜忽然一个想法,是不是我转化数字的时候,变了?
但是数据中本来就是数字的因子类型啊,我把它转化为数字的数字类型时会变么?
我早知道R中有这种坑,在factor转化为number时,一定要通过character,否则会有各种不可预知的坑
难道
难道说
这个坑被我遇到了么???

第二天上班,我迫切的测试了一下:

> tt = dat
> head(tt$ANIMAL)
[1] 1029 1299 643  1183 1238 891 
1084 Levels: 1 2 3 5 6 7 8 9 10 11 12 14 15 16 17 20 21 22 24 25 26 27 28 29 30 32 33 34 35 36 37 38 40 41 42 43 44 47 48 49 50 51 52 ... 1309
> head(as.numeric(tt$ANIMAL))
[1]  864 1076  549  989 1030  751

可以看到,变得面目全非,本来是1029,现在是864,本来是1299,现在是1076。

6. 恍然大迷瞪

看完之后,我激动的心无法平静,竟然想起了“为何相爱的人不能在一起的旋律”,我也太难了,竟然是这个原因。。。

脑子里想起祥林嫂的语句:
我早知道,R中factor转化为number时有可能出错。。。

然后我用character作为中间元素,再测试了一下:

> tt = dat
> head(tt$ANIMAL)
[1] 1029 1299 643  1183 1238 891 
1084 Levels: 1 2 3 5 6 7 8 9 10 11 12 14 15 16 17 20 21 22 24 25 26 27 28 29 30 32 33 34 35 36 37 38 40 41 42 43 44 47 48 49 50 51 52 ... 1309
> head(as.numeric(as.character(tt$ANIMAL)))
[1] 1029 1299  643 1183 1238  891

这就是对的了!

最后我用正确的形式,测试breedR中的动物模型:

> dd2 = dat
> dd2$ANIMAL = as.numeric(as.character(dd2$ANIMAL))
> mod2.br = remlf90(BWT ~ SEX, genetic =  list(model = "add_animal",pedigree = ped, id="ANIMAL"),data=dd2)
> summary(mod2.br)
Formula: BWT ~ 0 + SEX + pedigree 
   Data: dd2 
  AIC  BIC logLik
 5931 5941  -2964

Parameters of special components:


Variance components:
         Estimated variances   S.E.
genetic                2.494 0.9181
Residual              11.547 0.9386

终于看到了正确的结果,Va为2.49,Ve为11.54.

7. 多少人爱你青春欢畅的时辰

多么痛的领悟啊!

R中factor和number相互转化时,一定要经过character,这不是二手车市场,一定要有中间商赚差价!!!

为什么相爱的人不能在一起?_数据分析