Python网络页面抓取和页面分析

原创

wx63086371c7e9c 2022-08-26 14:51:57 博主文章分类：Python ©著作权

©著作权归作者所有：来自51CTO博客作者wx63086371c7e9c的原创作品，请联系作者获取转载授权，否则将追究法律责任

(1)安装第三方库httplib2
首先下载python的httplib2的安装包，下载地址为： http://code.google.com/p/httplib2/downloads/list；其次，在dos窗口下进入httplib2的解压目录，执行命令：python setup.py install 。即完成安装。然后在PyDev中加入这个第三方库，windows->preferences->PyDev->Editor->Interpreter-Python->Libraries->New Folder

http://docs.python.org/library/index.html 这个网址给出各种lib库的讲解。

(2)下面的例子是抓取http://guangzhou.8684.cn/x_24f5dad9这个网址的信息，然后通过正则表达式，提取其中的公交线路和公交站点信息。

#!/usr/bin/python
# -*- coding: utf-8 -*-
'''
Created on 2013-8-26

@author: chenll
'''

'''
【工具需求】抓取广州1路线的公交线路数据和站点数据。

'''
import os,httplib2,re

#获取HTMl页面内容
def getContent():
    h = httplib2.Http(".cache")
    resp, content = h.request("http://guangzhou.8684.cn/x_24f5dad9", 
    headers={'cache-control':'no-cache'})
    return content.decode('gbk').encode('utf-8') ;


def featch():
   content = getContent();
   #start:<div class="hc_d3" id="show1">
   #end:<h2 class="hc_re">
   startIndex = content.index('<div class="hc_d3" id="show1">');
   endIndex = content.index('<h2 class="hc_re">');
   subContent = content[startIndex:endIndex];
   reg = r'[\s\S]*<h2\s*class="hc_p6">([\s\S]*)<span\s*id="ad581"></span></h2>\s*<p\s*class="hc_p7"><span>([\S]*)</span>\s*<span>([\s\S]*)</span>\s*<span>([\s\S]*)</span>\s*<a href="[\s\S]*">[\s\S]*</a>\s*</p>\s*<p\s*class="hc_p8">([\s\S]*)';
   match = re.match(reg,subContent);
   if match:
       #线路名称
       lineName = match.group(1);
       #线路类型
       lineType = match.group(2);
       #起始首班车时间
       lineTime = match.group(3);
       #车票
       tickect = match.group(4);
       #站点信息
       stationInfo = match.group(5); 
       reg = r'\s*<i>去程：</i>([\s\S]*)<i>回程：</i>([\s\S]*)'
       match1 = re.match(reg,stationInfo);
       if match1:
           #去程
           qc = match1.group(1)
           qcArray = qc.split('-');
           for each in qcArray:
               reg = r'\s*<a\s*href="[\s\S]*">([\s\S]*)</a>\s*'
               match2 = re.match(reg,each);
               if match2:
                   #去程站点
                   print match2.group(1)
           #回程
           hc = match1.group(2);
           hcArray = hc.split('-');
           for each in hcArray:
               reg = r'\s*<a\s*href="[\s\S]*">([\s\S]*)</a>\s*'
               match2 = re.match(reg,each);
               if match2:
                   #回程站点
                   print match2.group(1)
#定义主调函数
def main():
    featch();

if __name__ == '__main__':
        main();