1. urllib和BeautifulSoup

1.1 urllib的基本用法

urllib是Python 3.x中提供的一系列操作URL的库,它可以轻松的模拟用户使用浏览器访问网页。

使用步骤:

  • 导入urllib库的request模块:from urllib import request
  • 请求URL,如:resp = request.urlopen(‘http://www.baidu.com’)
  • 使用响应对象输出数据,如:print(resp.read().decode(“utf-8”))

示例:

from urllib import request

resp = request.urlopen("http://www.baidu.com")
print(resp.read().decode("utf-8"))

1.1.1 模拟真实浏览器

  • 携带User-Agent头
# 使用Request(url)获取请求对象
req = request.Request(url)
# 使用add_header(key,value)方法添加请求头
req.add_header(key, value)
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))
from urllib import request
# 使用Request(url)获取请求对象
req = request.Request("http://www.baidu.com")
# 使用add_header(key,value)方法添加请求头
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
# 使用urlopen请求链接
resp = request.urlopen(req)
# 使用decode对结果进行编码
print(resp.read().decode("utf-8"))

1.1.2 使用POST

  • 导入urllib库下面的parse:from urllib import parse
  • 使用urlencode生成post数据
postData = parse.urlencode([
    (key1, val1),
    (key2, val2),
    (keyn, valn)
])
  • 使用postData发送post请求:
request.urlopen(req, data=postData.encode('utf-8'))
  • 得到请求状态:resp.status
  • 得到服务器的类型:resp.reason

示例:

from urllib.request import urlopen
from urllib.request import Request
from urllib import parse

req = Request("https://m.xbiquge.la/register.php")
# 使用parser.urlencode()生成post数据
postData = parse.urlencode([
    ("SignupForm[username]", "admin"),
    ("SignupForm[password]", "123456"),
    ("SignupForm[email]", ""),
    ("register", "确认注册")
])
req.add_header("Origin", "https://m.xbiquge.la")
req.add_header("User Agent", "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Mobile Safari/537.36")
resp = urlopen(req, data=postData.encode('utf-8'))
print(resp.read().decode("utf-8"))

执行结果:

<!doctype html>
<html>
<head>
<title>出现错误!</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="MobileOptimized" content="240"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0,  minimum-scale=1.0, maximum-scale=1.0" />
<style>
    .content{margin:50px 7px 10px 7px;}
    .content a{color:#0080C0;}
</style>
</head>
<body>
<div class="content">
    <div class="c1">
        <h1>出现错误!</h1>
        <strong>错误原因:</strong>
        <ul>
<li>用户名已存在.<li/><li>email不能为空!<li/>        <br /><br /><br />
        请 <a href="javascript:history.back(1)">返 回</a> 并修正<br /><br />
</div>
</body>
</html>

1.2 BeautifulSoup

1.2.1 安装BeautifulSoup4

  • Linux
sudo apt-get install python-bs4
  • Mac
sudo easy_install pip
pip install beautifulsoup4
  • Windows
pip install beautifulsoup4
pip3 install beautifulsoup4

word python 读取小节内容 python读取文档内容_pdfminer3k


word python 读取小节内容 python读取文档内容_urllib_02


1.2.2 BeautifulSoup使用

  • 使用BeautifulSoup(html, ‘html.parser’)解析HTML
  • 查找一个节点:soup.find(id=‘imooc’)
  • 查找多个节点:soup.findAll(‘a’)
  • 使用正则表达式匹配:soup.findAll(‘a’, href=reObj)
from bs4 import BeautifulSoup as bs


html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc, 'html.parser')
# 格式化输出
print(soup.prettify())

执行结果:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
from bs4 import BeautifulSoup as bs
import re

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = bs(html_doc, 'html.parser')
# 格式化输出
#print(soup.prettify())

# 获取title
print(soup.title.string) # The Dormouse's story
# 获取第一个a标签
print(soup.a) # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# 获取id为link2的标签
print(soup.find(id="link2")) # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# 获取id为link2的标签的内容
print(soup.find(id="link2").string) # Lacie
print(soup.find(id="link2").get_text()) # Lacie
# 获取所有a标签
print(soup.findAll("a"))
'''
[
    <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
    <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 
    <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
]
'''
# 输出所有a标签的内容
for link in soup.findAll("a"):
    print(link.string)
'''
Elsie
Lacie
Tillie
'''
print(soup.find('p', {"class":"story"}))
print(soup.find('p', {"class":"story"}).get_text())

# 正则表达式查找
# 找出所有以b开头的标签
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)

data = soup.findAll("a", href=re.compile(r"^http://example\.com/"))
print(data)

1.3 获取百度百科词条信息

# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re


# 处理空行
def bs_preprocess(html):
    """remove distracting whitespaces and newline characters"""
    pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
    html = re.sub(pat, '', html)  # remove leading and trailing whitespaces
    html = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格
    # this preserves newline delimiters
    html = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tags
    html = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tags
    return html

# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():
    # 是否包含em、img标签
    if tag.name in ["em", "img"]:
        # 包含则删除对应的标签
        tag.decompose()
    if tag.name in ['span']:
        # 父标签不是div
        if tag.parent.name != "div":
            tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):
    div.decompose()

# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:
    # 过滤以.jpg或JPG结尾的URL
    if url.get_text().strip() != "" and url['href'].strip() != "":
        # 输出URL的文字和对应的链接
        # string只能获取一个,get_text()获取标签下所有的文字
        print(url.get_text().strip(), "<--->", url['href'].strip())

执行结果:

word python 读取小节内容 python读取文档内容_BeautifulSoup_03

1.4 存储数据到MySQL

1.4.1 安装与卸载

  • 通过pip安装pymysql
pip install pymysql

word python 读取小节内容 python读取文档内容_urllib_04

  • 通过安装文件
python setup.py install
  • 卸载
pip uninstall pymysql

1.4.2 pymysql的使用

# 引入开发包
import pymysql.cursors

# 获取数据库链接
connection = pymysql.connect(host='localhost', 
                             user='root',
                             password='123456',
                             db='baikeurl',
                             charset='utf8mb4')

# 获取会话指针
connection.cursor()

# 执行SQL语句
cursor.execute(sql, (参数1, 参数n))

# 提交
connection.commit()

# 关闭
connection.close()

新建数据库:

word python 读取小节内容 python读取文档内容_python_05


建表sql:

SET NAMES utf8mb4;
SET FOREIGN_KEY_CHECKS = 0;

-- ----------------------------
-- Table structure for urls
-- ----------------------------
DROP TABLE IF EXISTS `urls`;
CREATE TABLE `urls`  (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `urlname` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  `urlhref` varchar(1000) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
  PRIMARY KEY (`id`) USING BTREE
) ENGINE = InnoDB AUTO_INCREMENT = 1 CHARACTER SET = utf8mb4 COLLATE = utf8mb4_general_ci ROW_FORMAT = Dynamic;

SET FOREIGN_KEY_CHECKS = 1;

1.4.3 存储数据到MySQL

  • 获取数据库连接:
connection = pymysql.connect(host='localhost',
                            user='root',
                            password='123456',
                            db='db',
                            charset='utf8mb4')
  • 使用connection.cursor()获取会话指针
  • 使用cursor.execute(sql, (参数1,参数n))执行sql
  • 提交connection.commit()
  • 关闭连接connection.close()
  • 使用cursor.execute()获取查询出多少条记录
  • 使用cursor.fetchone()获取下一行记录
  • 使用cursor.fetchmany(size=10)获取指定数量的记录
  • 使用cursor.fetchall()获取全部的记录
# 引入开发包
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import pymysql.cursors


# 处理空行
def bs_preprocess(html):
    """remove distracting whitespaces and newline characters"""
    pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
    html = re.sub(pat, '', html)  # remove leading and trailing whitespaces
    html = re.sub('\n', ' ', html)  # convert newlines to spaces将换行符替换成空格
    # this preserves newline delimiters
    html = re.sub('[\s]+<', '<', html)  # remove whitespaces before opening tags
    html = re.sub('>[\s]+', '>', html)  # remove whitespaces after closing tags
    return html

# 请求URL并把结果用UTF-8编码
resp = urlopen("https://baike.baidu.com/").read().decode("utf-8")
resp = bs_preprocess(resp)
# 使用BeautifulSoup去解析
soup = BeautifulSoup(resp, "html.parser")
# 格式化输出
# print(soup.prettify())
# 查找所有标签 按层级查找
for tag in soup.find_all():
    # 是否包含em、img标签
    if tag.name in ["em", "img"]:
        # 包含则删除对应的标签
        tag.decompose()
    if tag.name in ['span']:
        # 父标签不是div
        if tag.parent.name != "div":
            tag.decompose()
# 删除类名以category_或content_cnt开头的div
for div in soup.findAll("div", {"class": re.compile("^(category_|content_cnt)")}):
    div.decompose()

# 获取所有以https://baike.baidu.com/item/开头的a标签的属性
listUrls = soup.findAll("a", href=re.compile("^https://baike.baidu.com/item/"))
#print(listUrls)
# 输出所有的词条对应的名称和URL
for url in listUrls:
    # 过滤以.jpg或JPG结尾的URL
    if url.get_text().strip() != "" and url['href'].strip() != "":
        # 输出URL的文字和对应的链接
        # string只能获取一个,get_text()获取标签下所有的文字
        print(url.get_text().strip(), "<--->", url['href'].strip())

        # 获取数据库链接
        connection = pymysql.connect(host='localhost',
                                     user='root',
                                     password='123456',
                                     db='baikeurl',
                                     charset='utf8mb4')
        try:
            # 获取会话指针
            with connection.cursor() as cursor:
                # 创建sql语句
                sql = "insert into `urls` (`urlname`, `urlhref`) values (%s, %s)"
                # 执行sql语句
                cursor.execute(sql, (url.get_text(), url['href']))
                # 提交
                connection.commit()
        finally:
            connection.close()

效果:

word python 读取小节内容 python读取文档内容_word python 读取小节内容_06

1.4.4 读取(查询)MySQL数据

# 得到总记录数
cursor.execute()

# 查询下一行
cursor.fetchone()

# 得到指定大小
cursor.fetchmany(size=None)

# 得到全部
cursor.fetcchall()

# 关闭
connection.close()
# 导入开发包
import pymysql.cursors

# 获取链接
connection = pymysql.connect(host='localhost',
                             user='root',
                             password='123456',
                             db='baikeurl',
                             charset='utf8mb4')

try:
    # 获取会话指针
    with connection.cursor() as cursor:
        # 查询语句
        sql = "select `urlname`, `urlhref` from `urls` where `id` is not null"
        count = cursor.execute(sql)
        print(count)

        # 查询数据
        #result = cursor.fetchall()
        #print(result)

        result2 = cursor.fetchmany(size=3)
        print(result2)
finally:
    connection.close()

执行结果:

word python 读取小节内容 python读取文档内容_pdfminer3k_07


2. 常见文档读取(TXT,PDF)

2.1 python读取TXT文档

  • 读取TXT文档:urlopen()
  • 读取PDF文档:pdfminer3k
from urllib.request import urlopen

html = urlopen("")
print(html.read().decode("utf-8"))

执行效果:

User-agent: * 
Disallow: /scripts 
Disallow: /public 
Disallow: /css/ 
Disallow: /images/ 
Disallow: /content/ 
Disallow: /ui/ 
Disallow: /js/ 
Disallow: /scripts/ 
Disallow: /article_preview.html* 
Disallow: /tag/
Disallow: /*?*
Disallow: /link/

Sitemap: 
Sitemap:

2.2 pdfminer3k安装

pip install pdfminer3k

word python 读取小节内容 python读取文档内容_urllib_08

python setup.py install

2.3 python读取PDF文档

from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LAParams
from pdfminer.pdfparser import PDFParser, PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice

'''
w:以写方式打开;
a:以追加模式打开(从EOF开始,必要时创建新文件);
r+:以读写模式打开;
w+:以读写模式打开(参见w);
a+:以读写模式打开(参见a);
rb:以二进制读模式打开;
wb:以二进制追加模式打开(参见w);
ab:以二进制追加模式打开(参见a);
rb+:以二进制读写模式打开(参见r+);
wb+:以二进制读写模式打开(参见w+);
ab+:以二进制读写模式打开(参见a+)
'''
# 获取文档对象
# 以二进制读模式打开
fp = open("test.pdf", "rb")

# 创建一个与文档关联的解释器
parser = PDFParser(fp)

# PDF文档的对象
doc = PDFDocument()

# 链接解释器和文档对象
parser.set_document(doc)
doc.set_parser(parser)

# 初始化文档
doc.initialize("") # 密码为空

# 创建PDF资源管理器
resource = PDFResourceManager()

# 参数分析器
laparam = LAParams()

# 创建一个聚合器
device = PDFPageAggregator(resource, laparams=laparam)

# 创建PDF页面解释器
interpreter = PDFPageInterpreter(resource, device)

# 使用文档对象得到页面的集合
for page in doc.get_pages():
    # 使用页面解释器来读取
    interpreter.process_page(page)

    # 使用聚合器来获取内容
    layout = device.get_result()

    for out in layout:
        if hasattr(out, "get_text"):
            print(out.get_text())

test.pdf:

word python 读取小节内容 python读取文档内容_word python 读取小节内容_09


执行效果:

古之学者必有师。师者,所以传道受业解惑也。人非生而知之者,孰能无惑?
惑而不从师,其为惑也,终不解矣。生乎吾前,其闻道也固先乎吾,吾从而师之;
生乎吾后,其闻道也亦先乎吾,吾从而师之。吾师道也,夫庸知其年之先后生于吾
乎?是故无贵无贱,无长无少,道之所存,师之所存也。

月份
一月份
二月份
三月份

预期销售额

700
500
800

实际销售额

650
600
600

开始

主页

指南查询

输入检索关
键字

否

是

是否检索到相
关记录

是

结果分页显
示

是否继续

否

结束