用R获取芯片探针与基因的对应关系三部曲-bioconductor

现有的基因芯片种类不要太多了!
softminiml都是表示该platform的基础信息,比如GPL编号,上传日期等,soft文件的部分内容如下

但是重要而且常用的芯片并不多!
一般分析芯片数据都需要把探针的ID切换成基因的ID,我一般喜欢用基因的entrez ID。
一般有三种方法可以得到芯片探针与gene的对应关系。
金标准当然是去基因芯片的厂商的官网直接去下载啦!!!
一种是直接用bioconductor的包
一种是从NCBI里面下载文件来解析好!
首先,我们说官网,肯定可以找到,不然这种芯片出来就没有意义了!
然后,我们看看NCBI下载的,会比较大
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL6947 这两种方法都比较麻烦,需要一个个的来!
所以我接下来要讲的是用R的bioconductor包来批量得到芯片探针与gene的对应关系!
一般重要的芯片在R的bioconductor里面都是有包的,用一个R包可以批量获取有注释信息的芯片平台,我选取了常见的物种,如下:

0

gpl

organism

bioc_package.db

1

GPL32

Mus musculus

mgu74a

2

GPL33

Mus musculus

mgu74b

3

GPL34

Mus musculus

mgu74c

6

GPL74

Homo sapiens

hcg110

7

GPL75

Mus musculus

mu11ksuba

8

GPL76

Mus musculus

mu11ksubb

9

GPL77

Mus musculus

mu19ksuba

10

GPL78

Mus musculus

mu19ksubb

11

GPL79

Mus musculus

mu19ksubc

12

GPL80

Homo sapiens

hu6800

13

GPL81

Mus musculus

mgu74av2

14

GPL82

Mus musculus

mgu74bv2

15

GPL83

Mus musculus

mgu74cv2

16

GPL85

Rattus norvegicus

rgu34a

17

GPL86

Rattus norvegicus

rgu34b

18

GPL87

Rattus norvegicus

rgu34c

19

GPL88

Rattus norvegicus

rnu34

20

GPL89

Rattus norvegicus

rtu34

22

GPL91

Homo sapiens

hgu95av2

23

GPL92

Homo sapiens

hgu95b

24

GPL93

Homo sapiens

hgu95c

25

GPL94

Homo sapiens

hgu95d

26

GPL95

Homo sapiens

hgu95e

27

GPL96

Homo sapiens

hgu133a

28

GPL97

Homo sapiens

hgu133b

29

GPL98

Homo sapiens

hu35ksuba

30

GPL99

Homo sapiens

hu35ksubb

31

GPL100

Homo sapiens

hu35ksubc

32

GPL101

Homo sapiens

hu35ksubd

36

GPL201

Homo sapiens

hgfocus

37

GPL339

Mus musculus

moe430a

38

GPL340

Mus musculus

mouse4302

39

GPL341

Rattus norvegicus

rae230a

40

GPL342

Rattus norvegicus

rae230b

41

GPL570

Homo sapiens

hgu133plus2

42

GPL571

Homo sapiens

hgu133a2

43

GPL886

Homo sapiens

hgug4111a

44

GPL887

Homo sapiens

hgug4110b

45

GPL1261

Mus musculus

mouse430a2

49

GPL1352

Homo sapiens

u133x3p

50

GPL1355

Rattus norvegicus

rat2302

51

GPL1708

Homo sapiens

hgug4112a

54

GPL2891

Homo sapiens

h20kcod

55

GPL2898

Rattus norvegicus

adme16cod

60

GPL3921

Homo sapiens

hthgu133a

63

GPL4191

Homo sapiens

h10kcod

64

GPL5689

Homo sapiens

hgug4100a

65

GPL6097

Homo sapiens

illuminaHumanv1

66

GPL6102

Homo sapiens

illuminaHumanv2

67

GPL6244

Homo sapiens

hugene10sttranscriptcluster

68

GPL6947

Homo sapiens

illuminaHumanv3

69

GPL8300

Homo sapiens

hgu95av2

70

GPL8490

Homo sapiens

IlluminaHumanMethylation27k

71

GPL10558

Homo sapiens

illuminaHumanv4

72

GPL11532

Homo sapiens

hugene11sttranscriptcluster

73

GPL13497

Homo sapiens

74

GPL13534

Homo sapiens

IlluminaHumanMethylation450k

75

GPL13667

Homo sapiens

hgu219

76

GPL15380

Homo sapiens

GGHumanMethCancerPanelv1

77

GPL15396

Homo sapiens

hthgu133b

78

GPL17897

Homo sapiens

hthgu133a

这些包首先需要都下载

gpl_info=read.csv("GPL_info.csv",stringsAsFactors = F)
### first download all of the annotation packages from bioconductor
for (i in 1:nrow(gpl_info)){
  print(i)
  platform=gpl_info[i,4]
  platform=gsub('^ ',"",platform) ##主要是因为我处理包的字符串前面有空格
  #platformDB='hgu95av2.db'
  platformDB=paste(platform,".db",sep="")
  if( platformDB  %in% rownames(installed.packages()) == FALSE) {
    BiocInstaller::biocLite(platformDB)
    #source("http://bioconductor.org/biocLite.R");
    #biocLite(platformDB )
  }
}
下载完了所有的包, 就可以进行批量导出芯片探针与gene的对应关系!
for (i in 1:nrow(gpl_info)){
  print(i)
  platform=gpl_info[i,4]
  platform=gsub('^ ',"",platform)
  #platformDB='hgu95av2.db'
  platformDB=paste(platform,".db",sep="")
  if( platformDB  %in% rownames(installed.packages()) != FALSE) {
    library(platformDB,character.only = T)
    #tmp=paste('head(mappedkeys(',platform,'ENTREZID))',sep='')
    #eval(parse(text = tmp))
###重点在这里,把字符串当做命令运行
    all_probe=eval(parse(text = paste('mappedkeys(',platform,'ENTREZID)',sep='')))
    EGID <- as.numeric(lookUp(all_probe, platformDB, "ENTREZID"))
##自己把内容写出来即可
  }
}

参考:http://blog.sina.com.cn/s/blog_62b37bfe0101jbuq.html

从GEO的soft文件中提取探针序列信息

问题

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL16956 我需要注释这个平台,想要下载它的表格,可是直接表格点下面的view full table会显示在网页中,有五万多行,网页只能显示两万多行,根本不完整。R语言基因重注释后出现基因被识别成日期 r语言转换基因id_hg
思路
所以我只能下载文件了,隐约记得应该下载soft文件,可我不记得怎么读取了。
解决方案
1.搜索:
R语言基因重注释后出现基因被识别成日期 r语言转换基因id_R语言基因重注释后出现基因被识别成日期_02
找到了第一个链接:
https://warwick.ac.uk/fac/sci/moac/people/students/peter_cock/r/geo/
哦 原来是这个样子,getGEO可以。但是我的soft好大,接近三百兆,很容易下载不完整。
2.从Rstudio自带的terminal运行linux,下载soft文件到本地
用wget来下载到本地再读取,刚好Rstudio可以运行linux哦,之前发过一次:https://www.jianshu.com/p/c16c7095e4b2

wget -c ftp://ftp.ncbi.nlm.nih.gov/geo/platforms/GPL16nnn/GPL16956/soft/GPL16956_family.soft.gz
gunzip GPL16956_family.soft.gz

R语言基因重注释后出现基因被识别成日期 r语言转换基因id_hg_03


3.读取数据并处理

下载的是压缩包,解压不解压都行的。都可以读取

刚才的帖子里有用信息就是读取方法:

library(GEOquery)
x=getGEO(filename = "GPL16956_family.soft")

step1:从GPL中提取有用的部分,只需要数据框的’ID’,'SEQUENCE’两列。

y=x@dataTable@table[,c('ID','SEQUENCE')]
head(y)
##              ID
## 1 ASHGA5P000001
## 2 ASHGA5P000002
## 3 ASHGA5P000003
## 4 ASHGA5P000004
## 5 ASHGA5P000005
## 6 ASHGA5P000006
##                                                       SEQUENCE
## 1 GAGATAACTAGAACAGTGTCCCTCCCCTTTTATAACCTGTGTTTTTAGATTTCAAAAAGA
## 2 CTAACTGTCAAACAAAGATGGGCTAATTAAACGGATGCCAACTAATCGGAAAACTTCTGG
## 3 CCAGAAGCTCCAGCATGGTCCACCTACCAGAAGCTCCAGCATGATTTACCCATGAGTAGC
## 4 TAAGTACGGTAATGCAAGAAGGTGCAGCAGGCGTAGTGCATTACAGCAACACTGAAAAAC
## 5 TATGAGAGATGAAGAGATAAGGTCTAACTGTTACCTGGGCTGAACCCTTGGACTTCTAGG
## 6 CAGTAGCGAGTATTTACTAAGTACTTTCTATTTGCGAGGCCCTGATAAAAGTACTGTCCT

#断线好几次 保存一下

save(y,file = "16956.Rdata")

step2.重复值统计
几千万行肯定不对,进行去重统计,发现每个都重复了八百多次!不明白为什么。

library(tidyverse)
x=count(y,ID,sort = T)
head(x)
## # A tibble: 6 x 2
##   ID                                      n
##   <chr>                               <int>
## 1 !Sample_channel_count = 1             875
## 2 !Sample_molecule_ch1 = total RNA      875
## 3 !Sample_organism_ch1 = Homo sapiens   875
## 4 !Sample_platform_id = GPL16956        875
## 5 !sample_table_begin                   875
## 6 !sample_table_end                     875

step3.去重复,两种方法

test1=distinct(y,ID,SEQUENCE)
test2=y[!duplicated(y),]

发现有一些叹号开头的行没有被跳过,有NA。所以去除NA。与GEO网页上的行数进行对比,发现一致,说明正确了。保存一下

test3= na.omit(test2)
nrow(test3)
## [1] 58944
save(test3,file = "GPL16956.Rdata")
write.csv(test3,file = "GPL16956.csv",row.names = F)
test4=read.csv("GPL16956.csv")
head(test4)
##              ID
## 1 ASHGA5P000001
## 2 ASHGA5P000002
## 3 ASHGA5P000003
## 4 ASHGA5P000004
## 5 ASHGA5P000005
## 6 ASHGA5P000006
##                                                       SEQUENCE
## 1 GAGATAACTAGAACAGTGTCCCTCCCCTTTTATAACCTGTGTTTTTAGATTTCAAAAAGA
## 2 CTAACTGTCAAACAAAGATGGGCTAATTAAACGGATGCCAACTAATCGGAAAACTTCTGG
## 3 CCAGAAGCTCCAGCATGGTCCACCTACCAGAAGCTCCAGCATGATTTACCCATGAGTAGC
## 4 TAAGTACGGTAATGCAAGAAGGTGCAGCAGGCGTAGTGCATTACAGCAACACTGAAAAAC
## 5 TATGAGAGATGAAGAGATAAGGTCTAACTGTTACCTGGGCTGAACCCTTGGACTTCTAGG
## 6 CAGTAGCGAGTATTTACTAAGTACTTTCTATTTGCGAGGCCCTGATAAAAGTACTGTCCT

sessionInfo()

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-conda_cos6-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.5 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /pub/anaconda3/lib/R/lib/libRblas.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] knitr_1.22          forcats_0.4.0       stringr_1.4.0      
##  [4] dplyr_0.8.0.1       purrr_0.3.2         readr_1.3.1        
##  [7] tidyr_0.8.3         tibble_2.1.1        ggplot2_3.1.0      
## [10] tidyverse_1.2.1     GEOquery_2.50.5     Biobase_2.42.0     
## [13] BiocGenerics_0.28.0
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.1       cellranger_1.1.0 pillar_1.3.1     compiler_3.5.1  
##  [5] plyr_1.8.4       tools_3.5.1      evaluate_0.13    lubridate_1.7.4 
##  [9] jsonlite_1.6     nlme_3.1-137     gtable_0.3.0     lattice_0.20-38 
## [13] pkgconfig_2.0.2  rlang_0.3.3      cli_1.1.0        rstudioapi_0.10 
## [17] yaml_2.2.0       xfun_0.5         haven_2.1.0      withr_2.1.2     
## [21] httr_1.4.0       xml2_1.2.0       generics_0.0.2   hms_0.4.2       
## [25] grid_3.5.1       tidyselect_0.2.5 glue_1.3.1       R6_2.4.0        
## [29] fansi_0.4.0      readxl_1.3.1     limma_3.38.3     modelr_0.1.4    
## [33] magrittr_1.5     backports_1.1.3  scales_1.0.0     rvest_0.3.2     
## [37] assertthat_0.2.1 colorspace_1.4-1 utf8_1.1.4       stringi_1.4.3   
## [41] lazyeval_0.2.2   munsell_0.5.0    broom_0.5.1      crayon_1.3.4