Panel的设计其实很简单,根据实验目的来选择需要捕获的区域,我们需要做的就是把这些需要捕获的区域做成一个bed文件。

下面就以BRCA1/2两个基因来举例子,一般bed都是设计在基因的CDS区,因为内含子区域往往包含很多低复杂度区域(比如重复区域),所以内含子的捕获性能往往较差,后期分析难度也高。

我们需要先准备基因组注释文件,我从NCBI下载的最新版gtf文件(https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz)

下载下来可以直接解压

wget https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz
gzip -d GRCh37_latest_genomic.gtf.gz

然后先观察文件格式

less GRCh37_latest_genomic.gtf

格式大概有以下几类

#gtf-version 2.2
#!genome-build GRCh37.p13
#!genome-build-accession NCBI_Assembly:GCF_000001405.25
#!annotation-date 10/22/2020
#!annotation-source NCBI Homo sapiens Updated Annotation Release 105.20201022
NC_000017.10	BestRefSeq	gene	41196312	41277381	.	-	.	gene_id "BRCA1"; transcript_id ""; db_xref "GeneID:672"; db_xref "HGNC:HGNC:1100"; db_xref "MIM:113705"; description "BRCA1 DNA repair associated"; gbkey "Gene"; gene "BRCA1"; gene_biotype "protein_coding"; gene_synonym "BRCAI"; gene_synonym "BRCC1"; gene_synonym "BROVCA1"; gene_synonym "FANCS"; gene_synonym "IRIS"; gene_synonym "PNCA4"; gene_synonym "PPP1R53"; gene_synonym "PSCP"; gene_synonym "RNF53";
NC_000017.10	BestRefSeq	exon	41277199	41277381	.	-	.	gene_id "BRCA1"; transcript_id "NR_027676.2"; db_xref "GeneID:672"; gbkey "misc_RNA"; gene "BRCA1"; product "BRCA1 DNA repair associated, transcript variant 6"; exon_number "1";
NC_000017.10	BestRefSeq	CDS	41219625	41219712	.	-	0	gene_id "BRCA1"; transcript_id "NM_007298.3"; db_xref "CCDS:CCDS11454.2"; db_xref "GeneID:672"; gbkey "CDS"; gene "BRCA1"; note "isoform 4 is encoded by transcript variant 4"; product "breast cancer type 1 susceptibility protein isoform 4"; protein_id "NP_009229.2"; exon_number "15";

'#'开头的注释信息,第一列为染色体号(NC_000017.10对应chr17),第四列为起始位置,第五列为终止位置,我们按CDS区设计就只需要提取第三列为CDS的行(exon区域还包含两端的UTR区域,如果需要UTR区则按照exon设计)。

接下来我们开始提取BRCA1及BRCA2的CDS区域,为了便于后期扩展,把基因名放到单独的列表里:

gene.list
----
BRCA1
BRCA2

然后新建一个python脚本,把GRCh37染色体对应信息写进去:

make_bed.py
----
chr_dic = {"NC_000001.10": '1',
           "NC_000002.11": "2",
           "NC_000003.11": "3",
           "NC_000004.11": "4",
           "NC_000005.9": "5",
           "NC_000006.11": "6",
           "NC_000007.13": "7",
           "NC_000008.10": "8",
           "NC_000009.11": "9",
           "NC_000010.10": "10",
           "NC_000011.9": "11",
           "NC_000012.11": "12",
           "NC_000013.10": "13",
           "NC_000014.8": "14",
           "NC_000015.9": "15",
           "NC_000016.9": "16",
           "NC_000017.10": "17",
           "NC_000018.9": "18",
           "NC_000019.9": "19",
           "NC_000020.10": "20",
           "NC_000021.8": "21",
           "NC_000022.10": "22",
           "NC_000023.10": "X",
           "NC_000024.9": "Y",
           "NC_012920.1": "MT"}

接下来,将gtf文件转化为一个字典,字典的key为基因名,值为CDS起始与终止位置,基因名/转录本/外显子等信息都在gtf最后一列里提取

def phase_gtf(gtf_path, extract_type='CDS'):
    gene_dict = {}
    trans_list = []
    with open(gtf_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [x for x in datas if x[2] == extract_type and x[1] == 'BestRefSeq']
    for _data in datas:
        gene = _data[-1].split(';')[0].replace('gene_id ', '').replace('\"', '')
        trans = _data[-1].split(';')[1].replace('transcript_id ', '').replace('\"', '').replace(' ', '')
        exon = _data[-1].split(';')[-2].replace('exon_number', '').replace('\"', '').replace(' ', '')
        start = _data[3]
        end = _data[4]
        if _data[0] not in chr_dic:
            continue
        if gene not in gene_dict:
            gene_dict[gene] = [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
        else:
            gene_dict[gene] = gene_dict[gene] + [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
    return gene_dict

然后从字典里拿出来放到文件里即可:

gene_dict = phase_gtf('GRCh37_latest_genomic.gtf')

with open('gene.list', 'r') as f, open('raw.bed', 'w+') as f_o:
    for l in f:
        l = l.rstrip()
        if l not in gene_dict:
            print(l, "gene name not found")
            continue
        gene_detail = gene_dict[l]
        for _trans, _exon, _chr, _start, _end in gene_detail:
            f_o.write('\t'.join([_chr, str(_start), str(_end), l, _trans, _exon])+'\n')

把脚本合并起来:

make_bed.py
----
chr_dic = {"NC_000001.10": '1',
           "NC_000002.11": "2",
           "NC_000003.11": "3",
           "NC_000004.11": "4",
           "NC_000005.9": "5",
           "NC_000006.11": "6",
           "NC_000007.13": "7",
           "NC_000008.10": "8",
           "NC_000009.11": "9",
           "NC_000010.10": "10",
           "NC_000011.9": "11",
           "NC_000012.11": "12",
           "NC_000013.10": "13",
           "NC_000014.8": "14",
           "NC_000015.9": "15",
           "NC_000016.9": "16",
           "NC_000017.10": "17",
           "NC_000018.9": "18",
           "NC_000019.9": "19",
           "NC_000020.10": "20",
           "NC_000021.8": "21",
           "NC_000022.10": "22",
           "NC_000023.10": "X",
           "NC_000024.9": "Y",
           "NC_012920.1": "MT"}
           
def phase_gtf(gtf_path, extract_type='CDS'):
    gene_dict = {}
    trans_list = []
    with open(gtf_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [x for x in datas if x[2] == extract_type and x[1] == 'BestRefSeq']
    for _data in datas:
        gene = _data[-1].split(';')[0].replace('gene_id ', '').replace('\"', '')
        trans = _data[-1].split(';')[1].replace('transcript_id ', '').replace('\"', '').replace(' ', '')
        exon = _data[-1].split(';')[-2].replace('exon_number', '').replace('\"', '').replace(' ', '')
        start = _data[3]
        end = _data[4]
        if _data[0] not in chr_dic:
            continue
        if gene not in gene_dict:
            gene_dict[gene] = [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
        else:
            gene_dict[gene] = gene_dict[gene] + [[trans, 'exon%s' % exon, chr_dic[_data[0]], start, end]]
    return gene_dict
 
if __name__ == "__main__":
    gene_dict = phase_gtf('GRCh37_latest_genomic.gtf')

    with open('gene.list', 'r') as f, open('raw.bed', 'w+') as f_o:
        for l in f:
            l = l.rstrip()
            if l not in gene_dict:
                print(l, "gene name not found")
                continue
            gene_detail = gene_dict[l]
            for _trans, _exon, _chr, _start, _end in gene_detail:
                f_o.write('\t'.join([_chr, str(_start), str(_end), l, _trans, _exon])+'\n')

这样会生成一个文件,格式如下

17   41258473        41258543        BRCA1   NM_007297.4     exon3
17   41256885        41256973        BRCA1   NM_007297.4     exon4
17   41256139        41256278        BRCA1   NM_007297.4     exon5
17   41251792        41251897        BRCA1   NM_007297.4     exon6
17   41249261        41249306        BRCA1   NM_007297.4     exon7

当然这不是最终格式,因为这不合bed文件的规范,而且我习惯于把多个转录本的CDS区合并了过后设计,仔细看这份文件会发现它包含了多个转录本,也可以直接选最长的转录本,我们还是来合并CDS区间,另起一个python脚本

先读取刚刚生成的文件

def read_bed(bed_path):
    with open(bed_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [[x[0], int(x[1]), int(x[2]), x[3], x[4]] for x in datas]
    chr_list = list(set([x[0] for x in datas]))
    return chr_list, datas

然后我们需要对有重叠的区间合并,并在合并后对区间排序,这里设计成分染色体来处理

def handle_chr(data, _chr):
    data = [x for x in data if x[0] == _chr]
    check_num = 0
    while check_num == 0:
        check_num = 1
        data_backup = data[:]
        for _idx, _data in enumerate(data):
            for i in range(_idx+1, len(data)):
                if data[i][1] <= data[_idx][1] <= data[i][2] or data[i][1] <= data[_idx][2] <= data[i][2] or \
                        (data[_idx][1] <= data[i][1] and data[_idx][2] >= data[i][2]):
                    num_list = [data[i][1], data[i][2], data[_idx][1], data[_idx][2]]
                    num_list.sort()
                    new_line = [data[i][0], num_list[0], num_list[-1], data[i][3], data[i][4]]
                    data_backup.pop(i)
                    data_backup.pop(_idx)
                    data_backup.append(new_line)
                    check_num = 0
                    break
            if check_num == 0:
                break
        data = data_backup[:]
    check_num = 0
    while check_num == 0:
        check_num = 1
        for i in range(0, len(data)-1):
            if data[i][1] > data[i+1][1]:
                tmp_data = data[i][:]
                data[i] = data[i+1][:]
                data[i+1] = tmp_data[:]
                check_num = 0
    return data

然后就是写文件了,所以这个脚本合起来就是

merge_bed.py
---
def read_bed(bed_path):
    with open(bed_path, 'r') as f:
        datas = [l.rstrip().split('\t') for l in f.readlines() if not l.startswith('#')]
    datas = [[x[0], int(x[1]), int(x[2]), x[3], x[4]] for x in datas]
    chr_list = list(set([x[0] for x in datas]))
    return chr_list, datas
    
def handle_chr(data, _chr):
    data = [x for x in data if x[0] == _chr]
    check_num = 0
    while check_num == 0:
        check_num = 1
        data_backup = data[:]
        for _idx, _data in enumerate(data):
            for i in range(_idx+1, len(data)):
                if data[i][1] <= data[_idx][1] <= data[i][2] or data[i][1] <= data[_idx][2] <= data[i][2] or \
                        (data[_idx][1] <= data[i][1] and data[_idx][2] >= data[i][2]):
                    num_list = [data[i][1], data[i][2], data[_idx][1], data[_idx][2]]
                    num_list.sort()
                    new_line = [data[i][0], num_list[0], num_list[-1], data[i][3], data[i][4]]
                    data_backup.pop(i)
                    data_backup.pop(_idx)
                    data_backup.append(new_line)
                    check_num = 0
                    break
            if check_num == 0:
                break
        data = data_backup[:]
    check_num = 0
    while check_num == 0:
        check_num = 1
        for i in range(0, len(data)-1):
            if data[i][1] > data[i+1][1]:
                tmp_data = data[i][:]
                data[i] = data[i+1][:]
                data[i+1] = tmp_data[:]
                check_num = 0
    return data

if __name__ == "__main__":
    chr_list, datas = read_bed("raw.bed")
    total_data = []
    for i in range(1, 23):
        chr_data = handle_chr(datas, str(i))
        total_data += chr_data
    for _chr in ["X", "Y", "MT"]:
        total_data += handle_chr(datas, _chr)
    with open('merged_cds.bed', 'w') as f:
        f.write('\n'.join(["\t".join([str(y) for y in x]) for x in total_data])+'\n')

运行完会生成merged_cds.bed,完成合并后检查检查,没问题就做成bed格式:

awk '{print $1"\t"$2"\t"$3"\t+\t"$1"_"$2"_"$3}' merged_cds.bed > final.bed

生成的文件格式如下:

17   41258473        41258543        +   17_41258473_41258543
17   41256885        41256973        +   17_41256885_41256973
17   41256139        41256278        +   17_41256139_41256278

(PS: 别忘了cat一个表头进去,有些软件需要表头)