目录
说明
Admixture通过EM算法一般用于指定亚群分类;或者在不知材料群体结构背景下,通过迭代交叉验证获得error值,取最小error对应的K值为推荐亚群数目。如果我们预先已知群体的类型(百分百确信),那么可以考虑监督分类方法,设置标签,提高分群的准确性。
Admixture目前是1.3.0,文档也刚更新不久。
怕翻译有误,贴上官方文档:
Estimating P and Q from the SNP matrix G, without any additional information, can be
viewed as an unsupervised learning problem. However it is not uncommon that some or
all of the individuals in our data sample will have known ancestries, allowing us to set
some rows in the matrix Q to known constants. This allows more accurate estimation of
the ancestries of the remaining individuals, and of the ancestral allele frequencies. Viewing
these reference individuals as training samples, the problem is transformed into a supervised
learning problem.Supervised learning mode is enabled with the flag --supervised and requires an additional
file with a .pop suffix, specifying the ancestries of the reference individuals. It is assumed
that all reference samples have 100% ancestry from some ancestral population. Each line
of the .pop file corresponds to individual listed on the same line number in the .fam or
.ped file. If the individual is a population reference, the .pop file line should be a string
(beginning with an alphanumeric character) designating the population. If the individual
is of unknown ancestry, use “-” (or a blank line, or any non-alphanumeric character) to
indicate that the ancestry should be estimated.
文档中说要准备一个.pop为后缀的群体文件,就是对个体进行分类(字符型),类型未知的可用“-”替代。不建议在windows中创建,因为换行符不同的问题。
如何验证准备的.pop文件?作者建议使用paste .fam .pop查看个体数目是不是相等(用wc -l不是更简单吗?)。
问题来了,作者根本就没说明到底怎么运行?我尝试了下,简单记录下。
实战
下载官网示例数据:
http://dalexander.github.io/admixture/download.html
解压后,有plink数据格式,配套的bed,bim,fam,但少了个ped,没有和map配套。这个作者有点粗心,不过可以用plink转一下:
准备hapmap3.pop文件(注意前缀和pink数据保持一致,且在同一目录),可用R、awk等工具,随意模拟一个:
加上supervised,运行admixture即可:
可以看看不加supervised和加了的区别,没加的结果:
加了的结果:
还是有很大差异的。具体对后续结果的影响这里就不研究了。
作者:Bioinfarmer若要及时了解动态信息,请关注同名微信公众号:Bioinfarmer。