Python网络爬虫爬取文本 python全网爬取资料

转载

网络小墨舞风 2023-12-08 22:53:10

文章标签 Python网络爬虫爬取文本 python nlp 正则表达式数据 文章分类 Python 后端开发

因为目前没有公开的三句半语料库，所以在网络上爬取一些网站上公开的三句半数据。

主要分为两部分：

爬取数据

清洗数据

爬取数据

以爬取 http://p.onegreen.net/JuBen 上的三句半数据为例，说明爬取数据的python算法实现流程。

1. 首先，搜索关键词“三句半”得到网页结果列表，F12打开网页的“开发人员工具”，查看所需元素的名称，确定所要爬取的目标地址。

下图中顶部红框表示了搜索结果列表的网址，即外层url：

url= http://p.onegreen.net/JuBen/Search.asp?ModuleName=Article&Field=Title&Keyword=%C8%FD%BE%E4%B0%EB&ClassID=0&SpecialID=0&page=1

url末尾page的取值范围为 [1,24]

所有搜索结果列表所在标签为<div id="main2"> ，每个搜索结果所在标签为 <a class="LinkSearchResult">，相对地址为 href="/JuBen/HTML/219105.html"，即内层url

Python网络爬虫爬取文本 python全网爬取资料_nlp

存储所有搜索结果列表的网址，放在downloadUrls中，即外层url：

def set_download_urls():
    downloadUrls = []
    baseUrl = 'http://p.onegreen.net/JuBen/Search.asp?ModuleName=Article&Field=Title&Keyword=%C8%FD%BE%E4%B0%EB&ClassID=0&SpecialID=0&page='
    downloadUrls.append('http://p.onegreen.net/JuBen/Search.asp?ModuleName=Article&Field=Title&Keyword=%C8%FD%BE%E4%B0%EB&ClassID=0&SpecialID=0&page=1')
    for i in range(2, 32):
        url = baseUrl + str(i)
        downloadUrls.append(url)
    return downloadUrls

在外层url的基础上，存储所有搜索结果的网址，放在dict中，即内层url：

def get_download_url():
    downloadUrls = set_download_urls()
    articles = []
    for url in downloadUrls:
        req = requests.get(url)
        req.encoding = 'gb2312'
        html = req.text
        articles_bf = BeautifulSoup(html)
        articles.append(articles_bf.find_all('a',class_='LinkSearchResult'))
    return articles

def read_article_info():
    articles = get_download_url()
    baseUrl = 'http://p.onegreen.net/'
    dict = {}
    i=0
    for each in articles:
        for item in each:
            dict[i] = baseUrl + item.get('href')[1:]
            i=i+1
    return dict

2. 点击进入具体的三句半页面，F12打开网页的“开发人员工具”，查看三句半文本所在标签的名称 <div id="main2"> 。

Python网络爬虫爬取文本 python全网爬取资料_nlp_02

使用BeautifulSoup库函数快速方便地提取出 HTML标签中的内容：

def save_data():
    dict = read_article_info()
    for key, value in dict.items():
        get_content(key, value)

def get_content(title, url):
    print(str(title) + '---->' + url)
    req = requests.get(url)
    req.encoding = 'gb2312'
    html = req.text
    text_bf = BeautifulSoup(html)
    text = ""
    article = text_bf.find('div', id = 'main2')
    content = article.get_text()
    text = content
    write_item_to_file(text)

def write_item_to_file(item):
    print('开始写入数据 ====> ' + str(item))
    with open('sanjuban.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

3. 所有函数如上所示，最后只要调用一行函数就可以啦：

save_data()

爬取下来的数据如下所示，包含了很多噪声数据，还需要进一步进行数据清洗才能使用：

Python网络爬虫爬取文本 python全网爬取资料_nlp_03

PS：不同网站文本内容所在的标签名有可能不同，标签类型有可能不同，因此，需要灵活使用BeautifulSoup库函数，对症下药。

清洗数据

大量使用正则表达式。

1. 删掉每一行数据两端的双引号，用”\n”分割字符串，去除空字符串，去除包含特殊字词、特殊符号的字符串，并按行输出：

def organizeToSanjuban(inputFile, outputFile):
    fin = open(inputFile, 'r',encoding='UTF-8')
    txt_originlist = fin.readlines()
    fin.close()

    fout = open(outputFile, 'w',encoding='UTF-8')
    for txt_origin in txt_originlist:
        txt_origin=eval（txt_origin)
        txt_split=txt_origin.split("\n")
        for splits in txt_split:
            splits=splits.strip()
            if len(splits)>0:
                if splits.find("（载入中...）")==-1 and splits.find("台词")== -1 and splits.find("大全")== -1 :
                    if splits.find("★") == -1 and splits.find("【") == -1:
                        fout.write(splits+'\n')

    fout.close()

得到格式布局较为杂乱的三句半数据集：

Python网络爬虫爬取文本 python全网爬取资料_nlp_04

2. 在上一步的基础上，去掉每行字符串的开头的大小写字母、数字，两端的标点符号，去掉“甲乙丙丁”，去掉“（）()《》”及其内部文字，去掉多余的空格。

处理过后，每行字符串可能是：1)多组三句半 2)三句半的每一句话 3)一组三句半。

设定字符串长度阈值，判断字符串是哪种情况：

·当字符串是三句半的一句时，删除单独为字母的句子和单独为数字的句子，并按照顺序在末尾加上标点符号；

·当字符串是多组三句半或者一组三句半时，用标点符号分割短句，删除空的短句，每四条字符串组成一组三句半。

def normalizeToSanjuban(inputFile, outputFile):
    fin = open(inputFile, 'r',encoding='UTF-8')
    txt_originlist = fin.readlines()
    fin.close()
    rule = re.compile(u'[^a-zA-Z0-9.,;？！“”‘’@#￥%…&×——+-;；，。&～、|\s:：' + '\u4e00-\u9fa5]+')
    fout = open(outputFile, 'w',encoding='UTF-8')
    ii=1;
    txt_out="";
    for txt_origin in txt_originlist:
        txt_process=txt_origin.strip(string.ascii_uppercase) #去掉开头的大写字母
        txt_process = txt_process.strip(string.ascii_lowercase)#去掉开头的小写字母
        txt_process = txt_process.strip(string.digits)  # 去掉开头的数字
        #去掉括号及其内部文字
        txt_process = re.sub(u"\\(.*?\\)", "", txt_process) #去掉(xx)
        txt_process = re.sub(u"\\（.*?）", "", txt_process.encode('utf-8').decode()) # 去掉（xx）
        txt_process = re.sub(u"\\《.*?》", "", txt_process.encode('utf-8').decode())  # 去掉《xx》
        txt_process = re.sub(u"\\〈.*?〉", "", txt_process.encode('utf-8').decode())  # 去掉〈xx〉
        txt_process = re.sub('[甲乙丙丁齐――]', ' ', txt_process).strip()   # 去掉甲乙丙丁
        txt_process = re.sub("^([^\w]|_)+|�|([^\w]|_)+$", '', txt_process)#去掉字符串两端的标点符号
        txt_process = re.sub(rule, '', txt_process)
        txt_process = txt_process.strip()  # 去掉空格
#（鞠躬）
        length=len(txt_process)

        if length>0 and length<15 : #三句半中的一句
            if re.match('[ABCDabcd]*[0-9]*', txt_process) != None:  # 删除单独为字母的句子 和 单独为数字的句子
                txt_process = re.sub('[ABCD]*[0-9]*', "", txt_process, count=1)

            if ii % 2 == 1:
                txt_out=txt_out+txt_process+"，"
                ii += 1
            else:
                txt_out = txt_out + txt_process + "。"
                ii += 1
            if (ii-1) % 4 == 0:
                fout.write(txt_out + '\n')
                ii=1
                txt_out = "";

        else: #一组三句半 or 一堆堆三句半
            txt_split = re.split(r'[ !,-.:;?。！（），：；？~——]', txt_process.strip()) # 用标点符号分割短句
            for split in txt_split:
                if re.match('[ABCDabcd]*[0-9]*',split)!=None:# 删除单独为字母的句子 和 单独为数字的句子
                    idx=txt_split.index(split)
                    txt_split[idx] = re.sub('[ABCD]*[0-9]*', "", split,count=1)

            txt_split = [split.strip() for split in txt_split]
            txt_split = list(filter(None, txt_split))

            listlen=len(txt_split)
            if listlen % 4 == 0:            # 采用只有四个短句的三句半
                i=1
                for split in txt_split:
                    split=split.strip()
                    if i%2==1:
                        txt_out=txt_out+split.strip()+"，"
                        i+=1
                    else:
                        txt_out = txt_out + split.strip() + "。"
                        i += 1
                    if (i-1) % 4==0:
                        fout.write(txt_out + '\n')
                        txt_out="";

    fout.close()

得到格式规整的三句半：

Python网络爬虫爬取文本 python全网爬取资料_正则表达式_05

3. 删除重复项，并在句首加上“三句半名：”，组成最终训练数据格式。

def remove_duplication(inputFile, outputFile):
    fin = open(inputFile, 'r',encoding='UTF-8')
    txt_originlist = fin.readlines()
    fin.close()

    txt_processlist=list(set(txt_originlist))
    

    fout = open(outputFile, 'w',encoding='UTF-8')
    for txt in txt_processlist:
        txt=txt[13:]
        fout.write('三句半名:'+txt )

    fout.close()

得到最后的数据集：

Python网络爬虫爬取文本 python全网爬取资料_python_06

4.对上一步生成的数据集进行人工校对，去除异常数据，得到最终的数据集。

5. 所有函数如上所示，主函数中只要填写每个函数的参数（即txt文件名）就好啦：

if __name__ == '__main__':
    txt_path_in = "sanjuban_1.txt"
    txt_path_out = "sanjuban_2.txt"
    organizeToSanjuban(txt_path_in, txt_path_out)

    txt_path_in = "sanjuban_2.txt"
    txt_path_out = "sanjuban_3.txt"
    normalizeToSanjuban(txt_path_in, txt_path_out)


    txt_path_in = "sanjuban_3.txt"
    txt_path_out = "sanjuban_train.txt"
    trainToSanjuban(txt_path_in, txt_path_out)

    txt_path_in = "sanjuban_train.txt"
    txt_path_out = "sanjuban_out.txt"
    remove_duplication(txt_path_in, txt_path_out)

PS：对于三句半，我所做的数据清洗工作较为基础简单，还是要依靠一定量的人工干预，手动去掉一些干扰项，才能保证得到较为干净的数据集。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。