Panel的设计其实很简单,根据实验目的来选择需要捕获的区域,我们需要做的就是把这些需要捕获的区域做成一个bed文件。
下面就以BRCA1/2两个基因来举例子,一般bed都是设计在基因的CDS区,因为内含子区域往往包含很多低复杂度区域(比如重复区域),所以内含子的捕获性能往往较差,后期分析难度也高。
我们需要先准备基因组注释文件,我从NCBI下载的最新版gtf文件(https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz)
下载下来可以直接解压
wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz
gzip -d GRCh37_latest_genomic.gtf.gz
然后先观察文件格式
less GRCh37_latest_genomic.gtf
格式大概有以下几类
#gtf-version 2.2
#!genome-build GRCh37.p13
#!genome-build-accession NCBI_Assembly:GCF_000001405.25
#!annotation-date 10/22/2020
#!annotation-source NCBI Homo sapiens Updated Annotation Release 105.20201022
NC_000017.10 BestRefSeq gene 41196312 41277381 . - . gene_id "BRCA1"; transcript_id ""; db_xref "GeneID:672"; db_xref "HGNC:HGNC:1100"; db_xref "MIM:113705"; description "BRCA1 DNA repair associated"; gbkey "Gene"; gene "BRCA1"; gene_biotype "protein_coding"; gene_synonym "BRCAI"; gene_synonym "BRCC1"; gene_synonym "BROVCA1"; gene_synonym "FANCS"; gene_synonym "IRIS"; gene_synonym "PNCA4"; gene_synonym "PPP1R53"; gene_synonym "PSCP"; gene_synonym "RNF53";
NC_000017.10 BestRefSeq exon 41277199 41277381 . - . gene_id "BRCA1"; transcript_id "NR_027676.2"; db_xref "GeneID:672"; gbkey "misc_RNA"; gene "BRCA1"; product "BRCA1 DNA repair associated, transcript variant 6"; exon_number "1";
NC_000017.10 BestRefSeq CDS 41219625 41219712 . - 0 gene_id "BRCA1"; transcript_id "NM_007298.3"; db_xref "CCDS:CCDS11454.2"; db_xref "GeneID:672"; gbkey "CDS"; gene "BRCA1"; note "isoform 4 is encoded by transcript variant 4"; product "breast cancer type 1 susceptibility protein isoform 4"; protein_id "NP_009229.2"; exon_number "15";
'#'开头的注释信息,第一列为染色体号(NC_000017.10对应chr17),第四列为起始位置,第五列为终止位置,我们按CDS区设计就只需要提取第三列为CDS的行(exon区域还包含两端的UTR区域,如果需要UTR区则按照exon设计)。
接下来我们开始提取BRCA1及BRCA2的CDS区域,为了便于后期扩展,把基因名放到单独的列表里:
gene.list
----
BRCA1
BRCA2
然后新建一个python脚本,把GRCh37染色体对应信息写进去:
make_bed.py
----
chr_dic = {"NC_000001.10": '1',
"NC_000002.11": "2",
"NC_000003.11": "3",
"NC_000004.11": "4",
"NC_000005.9": "5",
"NC_000006.11": "6",
"NC_000007.13": "7",
"NC_000008.10": "8",
"NC_000009.11": "9",
"NC_000010.10": "10",
"NC_000011.9": "11",
"NC_000012.11": "12",
"NC_000013.10": "13",
"NC_000014.8": "14",
"NC_000015.9": "15",
"NC_000016.9": "16",
"NC_000017.10": "17",
"NC_000018.9": "18",
"NC_000019.9": "19",
"NC_000020.10": "20",
"NC_000021.8": "21",
"NC_000022.10": "22",
"NC_000023.10": "X",
"NC_000024.9": "Y",
"NC_012920.1": "MT"}
接下来,将gtf文件转化为一个字典,字典的key为基因名,值为CDS起始与终止位置,基因名/转录本/外显子等信息都在gtf最后一列里提取
def phase_gtf(gtf_path, extract_type='CDS'):
gene_dict = {}
trans_list = []
with open(gtf_path, 'r') as f:
datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
datas = [x for x in datas if x[2] == extract_type and x[1] == 'BestRefSeq']
for _data in datas:
gene = _data[-1].split(';')[0].replace('gene_id ', '').replace('\"', '')
trans = _data[-1].split(';')[1].replace('transcript_id ', '').replace('\"', '').replace(' ', '')
exon = _data[-1].split(';')[-2].replace('exon_number', '').replace('\"', '').replace(' ', '')
start = _data[3]
end = _data[4]
if _data[0] not in chr_dic:
continue
if gene not in gene_dict:
gene_dict[gene] = [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
else:
gene_dict[gene] = gene_dict[gene] + [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
return gene_dict
然后从字典里拿出来放到文件里即可:
gene_dict = phase_gtf('GRCh37_latest_genomic.gtf')
with open('gene.list', 'r') as f, open('raw.bed', 'w+') as f_o:
for l in f:
l = l.rstrip()
if l not in gene_dict:
print(l, "gene name not found")
continue
gene_detail = gene_dict[l]
for _trans, _exon, _chr, _start, _end in gene_detail:
f_o.write('\t'.join([_chr, str(_start), str(_end), l, _trans, _exon])+'\n')
把脚本合并起来:
make_bed.py
----
chr_dic = {"NC_000001.10": '1',
"NC_000002.11": "2",
"NC_000003.11": "3",
"NC_000004.11": "4",
"NC_000005.9": "5",
"NC_000006.11": "6",
"NC_000007.13": "7",
"NC_000008.10": "8",
"NC_000009.11": "9",
"NC_000010.10": "10",
"NC_000011.9": "11",
"NC_000012.11": "12",
"NC_000013.10": "13",
"NC_000014.8": "14",
"NC_000015.9": "15",
"NC_000016.9": "16",
"NC_000017.10": "17",
"NC_000018.9": "18",
"NC_000019.9": "19",
"NC_000020.10": "20",
"NC_000021.8": "21",
"NC_000022.10": "22",
"NC_000023.10": "X",
"NC_000024.9": "Y",
"NC_012920.1": "MT"}
def phase_gtf(gtf_path, extract_type='CDS'):
gene_dict = {}
trans_list = []
with open(gtf_path, 'r') as f:
datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
datas = [x for x in datas if x[2] == extract_type and x[1] == 'BestRefSeq']
for _data in datas:
gene = _data[-1].split(';')[0].replace('gene_id ', '').replace('\"', '')
trans = _data[-1].split(';')[1].replace('transcript_id ', '').replace('\"', '').replace(' ', '')
exon = _data[-1].split(';')[-2].replace('exon_number', '').replace('\"', '').replace(' ', '')
start = _data[3]
end = _data[4]
if _data[0] not in chr_dic:
continue
if gene not in gene_dict:
gene_dict[gene] = [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
else:
gene_dict[gene] = gene_dict[gene] + [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
return gene_dict
if __name__ == "__main__":
gene_dict = phase_gtf('GRCh37_latest_genomic.gtf')
with open('gene.list', 'r') as f, open('raw.bed', 'w+') as f_o:
for l in f:
l = l.rstrip()
if l not in gene_dict:
print(l, "gene name not found")
continue
gene_detail = gene_dict[l]
for _trans, _exon, _chr, _start, _end in gene_detail:
f_o.write('\t'.join([_chr, str(_start), str(_end), l, _trans, _exon])+'\n')
这样会生成一个文件,格式如下
17 41258473 41258543 BRCA1 NM_007297.4 exon3
17 41256885 41256973 BRCA1 NM_007297.4 exon4
17 41256139 41256278 BRCA1 NM_007297.4 exon5
17 41251792 41251897 BRCA1 NM_007297.4 exon6
17 41249261 41249306 BRCA1 NM_007297.4 exon7
当然这不是最终格式,因为这不合bed文件的规范,而且我习惯于把多个转录本的CDS区合并了过后设计,仔细看这份文件会发现它包含了多个转录本,也可以直接选最长的转录本,我们还是来合并CDS区间,另起一个python脚本
先读取刚刚生成的文件
def read_bed(bed_path):
with open(bed_path, 'r') as f:
datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
datas = [[x[0], int(x[1]), int(x[2]), x[3], x[4]] for x in datas]
chr_list = list(set([x[0] for x in datas]))
return chr_list, datas
然后我们需要对有重叠的区间合并,并在合并后对区间排序,这里设计成分染色体来处理
def handle_chr(data, _chr):
data = [x for x in data if x[0] == _chr]
check_num = 0
while check_num == 0:
check_num = 1
data_backup = data[:]
for _idx, _data in enumerate(data):
for i in range(_idx+1, len(data)):
if data[i][1] <= data[_idx][1] <= data[i][2] or data[i][1] <= data[_idx][2] <= data[i][2] or \
(data[_idx][1] <= data[i][1] and data[_idx][2] >= data[i][2]):
num_list = [data[i][1], data[i][2], data[_idx][1], data[_idx][2]]
num_list.sort()
new_line = [data[i][0], num_list[0], num_list[-1], data[i][3], data[i][4]]
data_backup.pop(i)
data_backup.pop(_idx)
data_backup.append(new_line)
check_num = 0
break
if check_num == 0:
break
data = data_backup[:]
check_num = 0
while check_num == 0:
check_num = 1
for i in range(0, len(data)-1):
if data[i][1] > data[i+1][1]:
tmp_data = data[i][:]
data[i] = data[i+1][:]
data[i+1] = tmp_data[:]
check_num = 0
return data
然后就是写文件了,所以这个脚本合起来就是
merge_bed.py
---
def read_bed(bed_path):
with open(bed_path, 'r') as f:
datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
datas = [[x[0], int(x[1]), int(x[2]), x[3], x[4]] for x in datas]
chr_list = list(set([x[0] for x in datas]))
return chr_list, datas
def handle_chr(data, _chr):
data = [x for x in data if x[0] == _chr]
check_num = 0
while check_num == 0:
check_num = 1
data_backup = data[:]
for _idx, _data in enumerate(data):
for i in range(_idx+1, len(data)):
if data[i][1] <= data[_idx][1] <= data[i][2] or data[i][1] <= data[_idx][2] <= data[i][2] or \
(data[_idx][1] <= data[i][1] and data[_idx][2] >= data[i][2]):
num_list = [data[i][1], data[i][2], data[_idx][1], data[_idx][2]]
num_list.sort()
new_line = [data[i][0], num_list[0], num_list[-1], data[i][3], data[i][4]]
data_backup.pop(i)
data_backup.pop(_idx)
data_backup.append(new_line)
check_num = 0
break
if check_num == 0:
break
data = data_backup[:]
check_num = 0
while check_num == 0:
check_num = 1
for i in range(0, len(data)-1):
if data[i][1] > data[i+1][1]:
tmp_data = data[i][:]
data[i] = data[i+1][:]
data[i+1] = tmp_data[:]
check_num = 0
return data
if __name__ == "__main__":
chr_list, datas = read_bed("raw.bed")
total_data = []
for i in range(1, 23):
chr_data = handle_chr(datas, str(i))
total_data += chr_data
for _chr in ["X", "Y", "MT"]:
total_data += handle_chr(datas, _chr)
with open('merged_cds.bed', 'w') as f:
f.write('\n'.join(["\t".join([str(y) for y in x]) for x in total_data])+'\n')
运行完会生成merged_cds.bed,完成合并后检查检查,没问题就做成bed格式:
awk '{print $1"\t"$2"\t"$3"\t+\t"$1"_"$2"_"$3}' merged_cds.bed > final.bed
生成的文件格式如下:
17 41258473 41258543 + 17_41258473_41258543
17 41256885 41256973 + 17_41256885_41256973
17 41256139 41256278 + 17_41256139_41256278
(PS: 别忘了cat一个表头进去,有些软件需要表头)