编者自语:

asreml是非常强大的软件, 由于太强大, 很多人不会使用. 基因组选择在育种中的应用, 其基础是常规的系谱动物模型, 动物模型也可以很复杂, 看一下asreml的说明书就知道了, 有300多页, 据我了解, 其厚度可以用这个公式表示:

a s r e m l 说 明 书 的 厚 度 > = D M U 的 厚 度 + W O M B A T 的 厚 度 + B L U P F 90 的 厚 度 + a s r e m l − R 的 厚 度 asreml说明书的厚度 > = DMU的厚度 + WOMBAT的厚度 + BLUPF90的厚度 + asreml-R的厚度 asreml>=DMU+WOMBAT+BLUPF90+asremlR

这说明一个问题, Arthur Gilmour教授(asreml的作者)是一个非常有耐心, 也非常厉害的统计学家, 他花费了自己的大半生, 将自己的心血编程了这个软件, 我很佩服.

这个教程是asreml在基因组选择和分子育种中的应用, 下面是我的读书笔记.

一个朋友说, 我们这个圈子很小了, 如果大家再不知道怎么分享, 怎么交流, 那我们这个学科以后怎么办呢, 这也是我停不下来的原因. 尼采说过: 力的过剩, 是力的证明. 他把不务正业说的这么理所应当, 搞得我将斜杠青年进行到底的决心变得更加稳固. 废话少说, 以下是目录.

目录:

基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析

简介

这篇文档的主要目标是介绍ASReml在基因组分析中的实现方法, 它假定读者有一定的统计基础. 在本文档中, 不对统计和模型做过多的介绍.

1, 单标记分析

示例数据:

ID,effect,SNP_1,SNP_100,SNP_1000,SNP_101,SNP_102,SNP_103,SNP_104,SNP_105,SNP_106,SNP_107,SNP_108,SNP_109,SNP_11,SNP_110,SNP_111,SNP_112,SNP_113,SNP_114
ID_1,-0.259731957336183,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_10,0.117554666740654,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
ID_100,0.00357380737732867,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_101,0.344906212015101,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0
ID_102,0.376403712779367,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
ID_103,0.131676984710817,0,0,0,0,1,1,0,1,1,1,0,0,0,0,0,0,0,0
ID_104,0.41299708896122,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
ID_105,0.353890056009646,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_106,0.237438809186312,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_107,-0.316455302927825,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
ID_108,-0.235784805404543,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_109,0.0783501427411017,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_11,0.0919863476998604,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0

ID, 观测值为effect, 第三列及以后为SNP 名称.

将每个标记作为固定因子, 循环运行:

!cycle SNP_1 SNP_100  SNP_1000  SNP_101  SNP_102  SNP_103   SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111  SNP_112 SNP_113 SNP_114

dd.csv  !SKIP 1

effect  ~ mu $I

可以在asr文件中, 查看每个SNP的显著性, 这是单标记方差分析.

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     651.0     0.83            0.363
  14 SNP_109                           2     651.0     5.20            0.006
 Finished: 19 Oct 2018 17:04:23.666   LogL Converged
 Folder: D:\spline\snp-asreml
 Cycle 13 value is SNP_11
 Reading dd.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of effect                                          
 Summary of 654 records retained of 654 read
  Warning: Fewer levels found in SNP_1  than specified
  Warning: Fewer levels found in SNP_101  than specified
  Warning: Fewer levels found in SNP_104  than specified
  Warning: Fewer levels found in SNP_11  than specified
  Warning: Fewer levels found in SNP_112  than specified
 Forming        3 equations:   3 dense.
 Initial updates will be shrunk by factor    0.316
 Notice: 1 singularities detected in design matrix.
   1 LogL= 603.924     S2= 0.56887E-01    652 df 
   2 LogL= 603.924     S2= 0.56887E-01    652 df 

          - - - Results from analysis of effect - - -
 LogL:  603.92  0.568871E-01     652    2 SNP_11 "LogL Converged"
 Akaike Information Criterion    -1205.85 (assuming 1 parameters).
 Bayesian Information Criterion  -1201.37

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Residual                SCA_V  654   1.00000      0.568871E-01  18.06   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     652.0     0.82            0.366
  15 SNP_11                            1     652.0     1.25            0.264
 Finished: 19 Oct 2018 17:04:24.058   LogL Converged
 Folder: D:\spline\snp-asreml
 Cycle 14 value is SNP_110
 Reading dd.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of effect                                          
 Summary of 654 records retained of 654 read
  Warning: Fewer levels found in SNP_1  than specified
  Warning: Fewer levels found in SNP_101  than specified
  Warning: Fewer levels found in SNP_104  than specified
  Warning: Fewer levels found in SNP_11  than specified
  Warning: Fewer levels found in SNP_112  than specified
 Forming        3 equations:   3 dense.
 Initial updates will be shrunk by factor    0.316
   1 LogL= 601.263     S2= 0.56936E-01    651 df 
   2 LogL= 601.263     S2= 0.56936E-01    651 df 

          - - - Results from analysis of effect - - -
 LogL:  601.26  0.569356E-01     651    2 SNP_110 "LogL Converged"
 Akaike Information Criterion    -1200.53 (assuming 1 parameters).
 Bayesian Information Criterion  -1196.05

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Residual                SCA_V  654   1.00000      0.569356E-01  18.04   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     651.0     0.82            0.366
  16 SNP_110                           2     651.0     0.85            0.429
 Finished: 19 Oct 2018 17:04:24.499   LogL Converged
 Folder: D:\spline\snp-asreml
 Cycle 15 value is SNP_111
 Reading dd.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of effect                                          
 Summary of 654 records retained of 654 read
  Warning: Fewer levels found in SNP_1  than specified
  Warning: Fewer levels found in SNP_101  than specified
  Warning: Fewer levels found in SNP_104  than specified
  Warning: Fewer levels found in SNP_11  than specified
  Warning: Fewer levels found in SNP_112  than specified
 Forming        3 equations:   3 dense.
 Initial updates will be shrunk by factor    0.316
   1 LogL= 600.791     S2= 0.57054E-01    651 df 
   2 LogL= 600.791     S2= 0.57054E-01    651 df 

          - - - Results from analysis of effect - - -
 LogL:  600.79  0.570539E-01     651    2 SNP_111 "LogL Converged"
 Local CYCLE LogL Peak at CYCLE:   12 SNP_109 LogL:   605.70 Deviance:   12.35
 Akaike Information Criterion    -1199.58 (assuming 1 parameters).
 Bayesian Information Criterion  -1195.10

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Residual                SCA_V  654   1.00000      0.570539E-01  18.04   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     651.0     0.81            0.367
  17 SNP_111                           2     651.0     0.17            0.843
 Finished: 19 Oct 2018 17:04:24.962   LogL Converged
 Folder: D:\spline\snp-asreml
 Cycle 16 value is SNP_112
 Reading dd.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of effect                                          
 Summary of 654 records retained of 654 read
  Warning: Fewer levels found in SNP_1  than specified
  Warning: Fewer levels found in SNP_101  than specified
  Warning: Fewer levels found in SNP_104  than specified
  Warning: Fewer levels found in SNP_11  than specified
  Warning: Fewer levels found in SNP_112  than specified
 Forming        3 equations:   3 dense.
 Initial updates will be shrunk by factor    0.316
 Notice: 1 singularities detected in design matrix.
   1 LogL= 602.714     S2= 0.56989E-01    652 df 
   2 LogL= 602.714     S2= 0.56989E-01    652 df 

          - - - Results from analysis of effect - - -
 LogL:  602.71  0.569893E-01     652    2 SNP_112 "LogL Converged"
 Akaike Information Criterion    -1203.43 (assuming 1 parameters).
 Bayesian Information Criterion  -1198.95

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Residual                SCA_V  654   1.00000      0.569893E-01  18.06   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     652.0     0.82            0.367
  18 SNP_112                           1     652.0     0.08            0.776
 Finished: 19 Oct 2018 17:04:25.435   LogL Converged
 Folder: D:\spline\snp-asreml
 Cycle 17 value is SNP_113
 Reading dd.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of effect                                          
 Summary of 654 records retained of 654 read
  Warning: Fewer levels found in SNP_1  than specified
  Warning: Fewer levels found in SNP_101  than specified
  Warning: Fewer levels found in SNP_104  than specified
  Warning: Fewer levels found in SNP_11  than specified
  Warning: Fewer levels found in SNP_112  than specified
 Forming        3 equations:   3 dense.
 Initial updates will be shrunk by factor    0.316
   1 LogL= 601.723     S2= 0.57001E-01    651 df 
   2 LogL= 601.723     S2= 0.57001E-01    651 df 

          - - - Results from analysis of effect - - -
 LogL:  601.72  0.570011E-01     651    2 SNP_113 "LogL Converged"
 Akaike Information Criterion    -1201.45 (assuming 1 parameters).
 Bayesian Information Criterion  -1196.97

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Residual                SCA_V  654   1.00000      0.570011E-01  18.04   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     651.0     0.82            0.367
  19 SNP_113                           2     651.0     0.47            0.623
 Finished: 19 Oct 2018 17:04:25.904   LogL Converged
 Folder: D:\spline\snp-asreml
 Cycle 18 value is SNP_114
 Reading dd.csv  FREE FORMAT skipping     1 lines

 Univariate analysis of effect                                          
 Summary of 654 records retained of 654 read
  Warning: Fewer levels found in SNP_1  than specified
  Warning: Fewer levels found in SNP_101  than specified
  Warning: Fewer levels found in SNP_104  than specified
  Warning: Fewer levels found in SNP_11  than specified
  Warning: Fewer levels found in SNP_112  than specified
 Forming        3 equations:   3 dense.
 Initial updates will be shrunk by factor    0.316
   1 LogL= 606.497     S2= 0.56038E-01    651 df 
   2 LogL= 606.497     S2= 0.56038E-01    651 df 

          - - - Results from analysis of effect - - -
 LogL:  606.50  0.560380E-01     651    2 SNP_114 "LogL Converged"
 Local CYCLE LogL Peak at CYCLE:   18 SNP_114 LogL:   606.50 Deviance:   13.94
 Akaike Information Criterion    -1210.99 (assuming 1 parameters).
 Bayesian Information Criterion  -1206.51

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Residual                SCA_V  654   1.00000      0.560380E-01  18.04   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
  21 mu                                1     651.0     0.83            0.363
  20 SNP_114                           2     651.0     6.08            0.002
 Best LogL   606.50  0.560380E-01     651    2 SNP_114 LogL Converged
 Finished: 19 Oct 2018 17:04:26.403   LogL Converged

结果可以看出, 第20(SNP_114)个SNP达到极显著, 第16(SNP_109)个SNP达到显著水平.

我们也可以将其作为随机因子, 查看Log-likehood评价模型. 如果比空模型好(LRT检验), 那说明标记效应明显.

!cycle SNP_1 SNP_100  SNP_1000  SNP_101  SNP_102  SNP_103   SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111  SNP_112 SNP_113 SNP_114
dd.csv  !SKIP 1

effect  ~ mu !r $I

结果:

LogL:    LogL  Residual        NEDF  NIT Cycle Text
 LogL:  607.75  0.564653E-01     653    6 SNP_1 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_100 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    7 SNP_1000 "LogL Converged"
 LogL:  606.37  0.567870E-01     653    4 SNP_101 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_102 "LogL Converged"
 LogL:  606.21  0.568392E-01     653    5 SNP_103 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_104 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_105 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_106 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_107 "LogL Converged"
 LogL:  606.57  0.567311E-01     653    4 SNP_108 "LogL Converged"
 LogL:  609.22  0.561598E-01     653    3 SNP_109 "LogL Converged"
 LogL:  606.12  0.568872E-01     653    5 SNP_11 "LogL Converged"
 Local CYCLE LogL Peak at CYCLE:   12 SNP_109 LogL:   609.22 Deviance:    6.22
 LogL:  606.16  0.568635E-01     653    4 SNP_110 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_111 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_112 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 SNP_113 "LogL Converged"
 LogL:  608.14  0.560577E-01     653    8 SNP_114 "LogL Converged"
 Local CYCLE LogL Peak at CYCLE:   18 SNP_114 LogL:   608.14 Deviance:    4.06

同样的结果, 我们可以看到Local CYCLE中 达到Peak的点在SNP_109 6.22 和SNP_114 4.06, 说明这两个SNP位点达到显著性水平.

另一种写法, 应对标记比较多的情况, 不用每个标记都需要用!cycle指定名称, 可以用!G N, N是标记个数进行代替. 这种方法的缺点是没有SNP标记名称.

 ID  !A      # ID_101
 effect        # 0.344906212015101 
 Marks !G 18
# !cycle SNP_1 SNP_100  SNP_1000  SNP_101  SNP_102  SNP_103   SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111  SNP_112 SNP_113 SNP_114
dd.csv  !SKIP 1

!cycle 1:18
effect  ~ mu !r Marks[$I]

结果:

 LogL:    LogL  Residual        NEDF  NIT Cycle Text
 LogL:  607.75  0.564653E-01     653    6 1 "LogL Converged"
 LogL:  606.10  0.569091E-01     653    6 2 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    7 3 "LogL Converged"
 LogL:  606.37  0.567870E-01     653    4 4 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 5 "LogL Converged"
 LogL:  606.39  0.567814E-01     653    4 6 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 7 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 8 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 9 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 10 "LogL Converged"
 LogL:  606.53  0.567416E-01     653    4 11 "LogL Converged"
 LogL:  607.88  0.564391E-01     653    5 12 "LogL Converged"
 LogL:  606.12  0.568872E-01     653    5 13 "LogL Converged"
 Local CYCLE LogL Peak at CYCLE:   12 12 LogL:   607.88 Deviance:    3.55
 LogL:  606.11  0.569077E-01     653    5 14 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 15 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 16 "LogL Converged"
 LogL:  606.11  0.569091E-01     653    6 17 "LogL Converged"
 LogL:  607.67  0.564839E-01     653    5 18 "LogL Converged"
 Local CYCLE LogL Peak at CYCLE:   18 18 LogL:   607.67 Deviance:    3.12

查看sln中的BLUP值, 放到excel中排序, 可以看出两个标记比较大:
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_02
如果有每个标记的map位置, 我们就可以进行作图.

2, 多标记分析

顾名思义, 就是讲所有Marks放在一起进行分析.

 ID  !A      # ID_101
 effect        # 0.344906212015101 
 Marks !G 18
# !cycle SNP_1 SNP_100  SNP_1000  SNP_101  SNP_102  SNP_103   SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111  SNP_112 SNP_113 SNP_114
dd.csv  !SKIP 1

# !cycle 1:18
# effect  ~ mu !r Marks[$I]

# effect ~ mu # LogL= 606.105
effect ~ mu !r Marks

基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_03结果:

   8 LogL= 607.362     S2= 0.55772E-01    653 df   0.1377E-01
 Final parameter values                        0.1378E-01

          - - - Results from analysis of effect - - -
 Akaike Information Criterion    -1210.72 (assuming 2 parameters).
 Bayesian Information Criterion  -1201.76

          Approximate stratum variance decomposition
 Stratum     Degrees-Freedom   Variance      Component Coefficients
 Marks                 17.30   0.965402E-01    53.0     1.0
 Residual Variance    635.70   0.557723E-01     0.0     1.0

 Model_Term                             Gamma         Sigma   Sigma/SE   % C
 Marks                   IDV_V   18  0.137842E-01  0.768776E-03   1.24   0 P
 Residual                SCA_V  654   1.00000      0.557723E-01  17.83   0 P

                                   Wald F statistics
     Source of Variation           NumDF     DenDF    F-inc            P-inc
   4 mu                                1      73.6     1.52            0.222
 Notice: The DenDF values are calculated ignoring fixed/boundary/singular
             variance parameters using algebraic derivatives.

                     Solution       Standard Error    T-value     T-prev
   4 mu                            
                    1  -0.167809E-01   0.136271E-01     -1.23
   3 Marks                                18 effects fitted

空模型的log值是606, Mark模型是607, 轻微提高.

查看sln的BLUP值
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_04

3, 基因组选择

理论介绍

GBLUP所依据的公式为:

y = X β + M g + e y = X\beta + Mg +e y=Xβ+Mg+e

M是n*m构成的矩阵, n是个体数, m为标记数(marker), g是每个标记的BLUP值. 随着标记数目的增加, m >>n的情况出现导致算法需要调整. 现在通用的是

G = M D M ′ G = MDM' G=MDM

如果已经计算出G矩阵, 可以使用asreml进行GBLUP的估算, 代码如下:

!work 12 !ARG 1
QTL ANALUSIS
 id !P
 SEX !A
 AGE !A
 HEIGHT !M -9999

idbgrm.ped !mark !alpha
ibdgrm.grm !ND !dense
ibdgrm.dat

HEIGHT ~ mu SEX !R nrm(id) grm1(id)
  • grm 文件为稠密矩阵(dense)的下三角
  • 固定因子为age, sex
  • 随机因子为加性效应, 基因组随机效应
  • asreml在估算GBlUP时, 会同时给出标记的效应值(marker effect), 结果文件在mef中.
    相关的R包, 参考wgaim

在下一章节中, 我们将对GS的延伸方法: Fast Bayes A进行介绍.

4, 基因组选择的其它方法

EM BayesA-like方法, 参考 Sun et al. (2012)开发而成.

一般标记矩阵的编码方法为: 0 1 2,

  • 0 为major等位基因: eg AA
  • 1 为杂合等位基因: eg Aa
  • 2 为minor等位基因: eg aa

构建矩阵的方法, 公式为:
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_05
具体参数:
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_06

Bayes A, 假定性状是由主效QTL控制, 少数QTL解释了一大半的变异, 而不是像GBLUP所假定每个标记的有相同的方差(符合正态分布)

Fast Bayes A:
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_07

Bayes B的方法在asreml中实现:
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_08

marker文件格式:

  • 文件命名为*.mkr
  • 第一列为基因型ID
  • 第一行为SNP ID
  • mkr中不能有缺失值

基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_09

标记文件的命令参数, 这些参数都需要和标记文件放在同一行才可以起作用
filename.mkr

  • !markers m # 标记的个数(可以省略)
  • !IDS n # 个体的个数(可以省略)
  • !FBA k # 定义asreml是否使用GBLUP(省略, 为GBLUP, 标记方差一致, k=0), k在Fast BayesA中是标记的方差分布符合逆卡方(inverse Chi-square)分布的参数, 如果使用!FBA, 默认的k=4. 一般来说k需要大于3小于20. 如果!FBA出现, asreml会默认使用!EXTRA 5用于读取mef文件, 当做初始值.
  • !FBB p # p是百分数, 设置多大比例标记方差组分为0(对应的是标记的效应值也为0), 这里可以定义BayesB
  • !HEADER 0 # 标记没有行头
  • !SKIP c # 掉过的行数
  • !CSKIP # 掉过的列数, 使用!SKIP -1表示第一列没有ID, 是SNP

以下参数不常用

  • !OFFSET o
  • !CENTER
  • !SAVEGIV g
  • !PENALTY d
  • !DFOFFSET t
  • !MSCALE s
  • !PEV

权重G矩阵
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_10

常规GBLUP命令

!wrokspace 1
title: standard GBLUP model
 ID *
 phenotype
genotype.mrk !markers 10031 !IDS 3226 # 标记文件有10031个SNP, ID有3225个
phenotype.txt !skip 1 !maxit 50 !gdense #使用稠密矩阵(dense)

phenotype ~ mu !r grm1(ID)
residual units

结果说明

  • 基因型个体的GBLUP值在.sln文件中
  • 如果标记ID有1000个, mark文件ID有1500, 则sln文件也会有1500, 另外500为GBLUP预测值(即这部分没有表型值, 根据基因型进行的GBLUP值预测)
  • 标记的效应值在.mef文件中, 如果!PEV在mark文件后面, .mef文件中会有标准误

Fast Bayes A方法命令
很多时候, 我们对一些效应较大的标记感兴趣, 例如QTL, 但是GBLUP估计是收缩是估计(shrunken estimators), QTL的效应值会被周围的标记吸收掉, 导致大效应标记难以发现.
Bayes A的模型可以鉴定少数大效应的标记, 这里的Fast Bayes-A like 方法类似. 对于一些性状, Fast Bayes-A比GBLUP的预测效果更好.

调整对角线D
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_11

常规Fast-BayesA命令

!wrokspace 1
title: Fast-BayesA model
 ID !A
 phenotype
genotype.mrk !markers 10031 !IDS 3226 !FBA 4.2 # 标记文件有10031个SNP, ID有3225个, !FBA 设置为4.2
phenotype.txt !skip 1 !maxit 50 

phenotype ~ mu !r grm1(ID) 0.808 !GF # 这里Vg的gamma设置为0.808, 固定方差组分
residual units

结果说明

  • .mef包括marker的效应值, 以及权重(weight)
  • .res 包括显著性的SNP

不同的K值, Vg是固定还是估计 比较
基因组选择和SNP分析在ASREML-SA中的实现方法_数据分析_12

结论:

  • k值为4左右是, 效果比较好
  • Vg是固定还是估算, 影响不大, 默认估算

5, 使用asreml注意事项

  • 只有一个GRM文件可以用, 如果有多个, 建议转化为giv使用
  • 对于Fast Bayes模型中, 只有一个GRM能够使用, 如果有其它, 使用giv
  • ID 的顺序要和G的ID顺序一致, 建议将G的ID单独抽取出来, 用!L 定义
  • !PEV会给出标记的标准误, 结果不可靠

基因型的GBLUP在.sln中, mark的效应在.mef中, 标记的权重(weight)在.mef中, 大效应的标记在.res文件中.

6, asreml基因组选择考虑GWAS和QTL显著性位点

如果已经鉴定出大效应的SNP, 可以放在模型中, 这样模型就可以利用GWAS和QTL的信息, 提高预测的准确性.

snp(ID, 954) snp(ID,4480)

可以作为固定因子, 或者随机因子.

后记

GS中, 多性状GS模型的效果要高于单性状GS, asreml中有很多强大的函数可以利用, 未来可期.