word python 读取小节内容 python读取文档内容

转载

mob6454cc77b8eb 2024-08-30 20:50:59

文章标签 word python 读取小节内容 python urllib BeautifulSoup pdfminer3k 文章分类 Python 后端开发

1. urllib和BeautifulSoup

1.1 urllib的基本用法

urllib是Python 3.x中提供的一系列操作URL的库，它可以轻松的模拟用户使用浏览器访问网页。

使用步骤：

导入urllib库的request模块：from urllib import request
请求URL，如：resp = request.urlopen(‘http://www.baidu.com’)
使用响应对象输出数据，如：print(resp.read().decode(“utf-8”))

示例：

from urllib import request

resp = request.urlopen("http://www.baidu.com")
print(resp.read().decode("utf-8"))

1.1.1 模拟真实浏览器

携带User-Agent头

# 使用Request(url)获取请求对象
req = request.Request(url)
# 使用add_header(key,value)方法添加请求头
req.add_header(key, value)
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

from urllib import request
# 使用Request(url)获取请求对象
req = request.Request("http://www.baidu.com")
# 使用add_header(key,value)方法添加请求头
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

1.1.2 使用POST

导入urllib库下面的parse：from urllib import parse
使用urlencode生成post数据

postData = parse.urlencode([
    (key1, val1),
    (key2, val2),
    (keyn, valn)
])

使用postData发送post请求：

request.urlopen(req, data=postData.encode('utf-8'))

得到请求状态：resp.status
得到服务器的类型：resp.reason

示例：

from urllib.request import urlopen
from urllib.request import Request
from urllib import parse

req = Request("https://m.xbiquge.la/register.php")
# 使用parser.urlencode()生成post数据
postData = parse.urlencode([
    ("SignupForm[username]", "admin"),
    ("SignupForm[password]", "123456"),
    ("SignupForm[email]", ""),
    ("register", "确认注册")
])
req.add_header("Origin", "https://m.xbiquge.la")
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
resp = urlopen(req, data=postData.encode('utf-8'))
print(resp.read().decode("utf-8"))

执行结果：

<!doctype html>
<html>
<head>
<title>出现错误！</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="MobileOptimized" content="240"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0,  minimum-scale=1.0, maximum-scale=1.0" />
<style>
    .content{margin:50px 7px 10px 7px;}
    .content a{color:#0080C0;}
</style>
</head>
<body>
<div class="content">
    <div class="c1">
        <h1>出现错误！</h1>
        <strong>错误原因：</strong>
        <ul>
<li>用户名已存在.<li/><li>email不能为空！<li/>        <br /><br /><br />
        请 <a href="javascript:history.back(1)">返 回</a> 并修正<br /><br />
</div>
</body>
</html>

1.2 BeautifulSoup

1.2.1 安装BeautifulSoup4

Linux

sudo apt-get install python-bs4

sudo easy_install pip
pip install beautifulsoup4

Windows

pip install beautifulsoup4
pip3 install beautifulsoup4

word python 读取小节内容 python读取文档内容_pdfminer3k

word python 读取小节内容 python读取文档内容_urllib_02

1.2.2 BeautifulSoup使用

使用BeautifulSoup(html, ‘html.parser’)解析HTML
查找一个节点：soup.find(id=‘imooc’)
查找多个节点：soup.findAll(‘a’)
使用正则表达式匹配：soup.findAll(‘a’, href=reObj)

from bs4 import BeautifulSoup as bs


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc, 'html.parser')
# 格式化输出
print(soup.prettify())

执行结果：

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

from bs4 import BeautifulSoup as bs
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc, 'html.parser')
# 格式化输出
#print(soup.prettify())

# 获取title
print(soup.title.string) # The Dormouse's story
# 获取第一个a标签
print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 获取id为link2的标签
print(soup.find(id="link2")) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 获取id为link2的标签的内容
print(soup.find(id="link2").string) # Lacie
print(soup.find(id="link2").get_text()) # Lacie
# 获取所有a标签
print(soup.findAll("a"))
'''
[
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
]
'''
# 输出所有a标签的内容
for link in soup.findAll("a"):
    print(link.string)
'''
Elsie
Lacie
Tillie
'''
print(soup.find('p', {"class":"story"}))
print(soup.find('p', {"class":"story"}).get_text())

# 正则表达式查找
# 找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

data = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data)

1.3 获取百度百科词条信息

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


# 处理空行
def bs_preprocess(html):
    """remove distracting whitespaces and newline characters"""
    pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
    html = re.sub(pat, '', html)  # remove leading and trailing whitespaces
    html = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格
    # this preserves newline delimiters
    html = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tags
    html = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tags
    return html

# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():
    # 是否包含em、img标签
    if tag.name in ["em", "img"]:
        # 包含则删除对应的标签
        tag.decompose()
    if tag.name in ['span']:
        # 父标签不是div
        if tag.parent.name != "div":
            tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):
    div.decompose()

# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:
    # 过滤以.jpg或JPG结尾的URL
    if url.get_text().strip() != "" and url['href'].strip() != "":
        # 输出URL的文字和对应的链接
        # string只能获取一个，get_text()获取标签下所有的文字
        print(url.get_text().strip(), "<--->", url['href'].strip())

执行结果：

word python 读取小节内容 python读取文档内容_BeautifulSoup_03

1.4 存储数据到MySQL

1.4.1 安装与卸载

通过pip安装pymysql

pip install pymysql

word python 读取小节内容 python读取文档内容_urllib_04

通过安装文件

python setup.py install

卸载

pip uninstall pymysql

1.4.2 pymysql的使用

# 引入开发包
import pymysql.cursors

# 获取数据库链接
connection = pymysql.connect(host='localhost', 
                             user='root',
                             password='123456',
                             db='baikeurl',
                             charset='utf8mb4')

# 获取会话指针
connection.cursor()

# 执行SQL语句
cursor.execute(sql, (参数1, 参数n))

# 提交
connection.commit()

# 关闭
connection.close()

新建数据库：

word python 读取小节内容 python读取文档内容_python_05

建表sql：

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for urls
-- ----------------------------
DROP TABLE IF EXISTS `urls`;
CREATE TABLE `urls`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `urlname` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  `urlhref` varchar(1000) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;

1.4.3 存储数据到MySQL

获取数据库连接：

connection = pymysql.connect(host='localhost',
                            user='root',
                            password='123456',
                            db='db',
                            charset='utf8mb4')

使用connection.cursor()获取会话指针
使用cursor.execute(sql, (参数1,参数n))执行sql
提交connection.commit()
关闭连接connection.close()
使用cursor.execute()获取查询出多少条记录
使用cursor.fetchone()获取下一行记录
使用cursor.fetchmany(size=10)获取指定数量的记录
使用cursor.fetchall()获取全部的记录

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql.cursors


# 处理空行
def bs_preprocess(html):
    """remove distracting whitespaces and newline characters"""
    pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
    html = re.sub(pat, '', html)  # remove leading and trailing whitespaces
    html = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格
    # this preserves newline delimiters
    html = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tags
    html = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tags
    return html

# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():
    # 是否包含em、img标签
    if tag.name in ["em", "img"]:
        # 包含则删除对应的标签
        tag.decompose()
    if tag.name in ['span']:
        # 父标签不是div
        if tag.parent.name != "div":
            tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):
    div.decompose()

# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:
    # 过滤以.jpg或JPG结尾的URL
    if url.get_text().strip() != "" and url['href'].strip() != "":
        # 输出URL的文字和对应的链接
        # string只能获取一个，get_text()获取标签下所有的文字
        print(url.get_text().strip(), "<--->", url['href'].strip())

        # 获取数据库链接
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='123456',
                                     db='baikeurl',
                                     charset='utf8mb4')
        try:
            # 获取会话指针
            with connection.cursor() as cursor:
                # 创建sql语句
                sql = "insert into `urls` (`urlname`, `urlhref`) values (%s, %s)"
                # 执行sql语句
                cursor.execute(sql, (url.get_text(), url['href']))
                # 提交
                connection.commit()
        finally:
            connection.close()

效果：

word python 读取小节内容 python读取文档内容_word python 读取小节内容_06

1.4.4 读取（查询）MySQL数据

# 得到总记录数
cursor.execute()

# 查询下一行
cursor.fetchone()

# 得到指定大小
cursor.fetchmany(size=None)

# 得到全部
cursor.fetcchall()

# 关闭
connection.close()

# 导入开发包
import pymysql.cursors

# 获取链接
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             db='baikeurl',
                             charset='utf8mb4')

try:
    # 获取会话指针
    with connection.cursor() as cursor:
        # 查询语句
        sql = "select `urlname`, `urlhref` from `urls` where `id` is not null"
        count = cursor.execute(sql)
        print(count)

        # 查询数据
        #result = cursor.fetchall()
        #print(result)

        result2 = cursor.fetchmany(size=3)
        print(result2)
finally:
    connection.close()

执行结果：

word python 读取小节内容 python读取文档内容_pdfminer3k_07

2. 常见文档读取（TXT，PDF）

2.1 python读取TXT文档

读取TXT文档：urlopen()
读取PDF文档：pdfminer3k

from urllib.request import urlopen

html = urlopen("")
print(html.read().decode("utf-8"))

执行效果：

User-agent: * 
Disallow: /scripts 
Disallow: /public 
Disallow: /css/ 
Disallow: /images/ 
Disallow: /content/ 
Disallow: /ui/ 
Disallow: /js/ 
Disallow: /scripts/ 
Disallow: /article_preview.html* 
Disallow: /tag/
Disallow: /*?*
Disallow: /link/

Sitemap: 
Sitemap:

2.2 pdfminer3k安装

pip install pdfminer3k

word python 读取小节内容 python读取文档内容_urllib_08

python setup.py install

2.3 python读取PDF文档

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

'''
w：以写方式打开；
a：以追加模式打开（从EOF开始，必要时创建新文件）；
r+：以读写模式打开；
w+：以读写模式打开（参见w）；
a+：以读写模式打开（参见a）；
rb：以二进制读模式打开；
wb：以二进制追加模式打开（参见w）；
ab：以二进制追加模式打开（参见a）；
rb+：以二进制读写模式打开（参见r+）；
wb+：以二进制读写模式打开（参见w+）；
ab+：以二进制读写模式打开（参见a+）
'''
# 获取文档对象
# 以二进制读模式打开
fp = open("test.pdf", "rb")

# 创建一个与文档关联的解释器
parser = PDFParser(fp)

# PDF文档的对象
doc = PDFDocument()

# 链接解释器和文档对象
parser.set_document(doc)
doc.set_parser(parser)

# 初始化文档
doc.initialize("") # 密码为空

# 创建PDF资源管理器
resource = PDFResourceManager()

# 参数分析器
laparam = LAParams()

# 创建一个聚合器
device = PDFPageAggregator(resource, laparams=laparam)

# 创建PDF页面解释器
interpreter = PDFPageInterpreter(resource, device)

# 使用文档对象得到页面的集合
for page in doc.get_pages():
    # 使用页面解释器来读取
    interpreter.process_page(page)

    # 使用聚合器来获取内容
    layout = device.get_result()

    for out in layout:
        if hasattr(out, "get_text"):
            print(out.get_text())

test.pdf：

word python 读取小节内容 python读取文档内容_word python 读取小节内容_09

执行效果：

古之学者必有师。师者，所以传道受业解惑也。人非生而知之者，孰能无惑？
惑而不从师，其为惑也，终不解矣。生乎吾前，其闻道也固先乎吾，吾从而师之；
生乎吾后，其闻道也亦先乎吾，吾从而师之。吾师道也，夫庸知其年之先后生于吾
乎？是故无贵无贱，无长无少，道之所存，师之所存也。

月份
一月份
二月份
三月份

预期销售额

700
500
800

实际销售额

650
600
600

开始

主页

指南查询

输入检索关
键字

否

是

是否检索到相
关记录

是

结果分页显
示

是否继续

否

结束

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python 多线程阻塞会导致任务加了try也直接不运行了 python 多线程坑

下一篇：sqlalchemy 和pymysql sqlalchemy和pymysql比较

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯