python获取天猫后台数据 python爬取天猫商品

转载

mob6454cc63af5e 2023-08-25 17:31:28

文章标签 python获取天猫后台数据 xpath python 爬虫 html 文章分类 Python 后端开发

python爬取天猫商品信息
主要信息有：商品名，价格，月销量，评论数，人气值，店铺评分

以智能手机为例!
首先，发掘网址规律：

python获取天猫后台数据 python爬取天猫商品_xpath

python获取天猫后台数据 python爬取天猫商品_xpath_02

第二页的网址如上

python获取天猫后台数据 python爬取天猫商品_爬虫_03

第三页的网址如上

注意网址中的数字（靠近中间位置）：第二页->60,第三页->120

所以大胆猜测网址的规律就体现在这个数字中

经过尝试，规律确实如此

所以可以通过循环，改变数字的值，访问下一页

代码：

headers = {
    'User-Agent':'',

    'Cookie':''
    }

headers代码，user-agent和cookie可以打开任意网页，右键’检查‘，在network文件中查找，复制下来即可
代码：

def gethtml(url,headers):
    try:
        response=requests.get(url,headers=headers)
        #response.encoding='utf-8'
        if response.status_code==200:
            return response.text
        return None
    except RequestException:
        return None

上边的代码块非常通用，可以用来访问很多网页
代码：

for i in range(20,81):
        price=[]
        name=[]
        sales=[]
        comments=[]
        link=[]
        url = 'https://list.tmall.com/search_product.htm?spm=a220m.1000858.0.0.5c1b4227WVFbrJ&s='\
        +str(i*60)+\
        '&q=%D6%C7%C4%DC%CA%D6%BB%FA&sort=s&style=g&from=.list.pc_1_suggest&suggest=0_3&type=pc#J_Filter'
        pprint(url)
        response = gethtml(url, headers)
`

接下来用xpath提取所需要的信息，
代码：

parser=etree.HTML(response,etree.HTMLParser())

注意导包lxml.etree

至于xpath如何获取：

网页右键检查，找到所需要的信息的位置，右键->copy->xpath

python获取天猫后台数据 python爬取天猫商品_爬虫_04

如果所需要的信息在当前位置的标题里：
用/@title提取，title为标题的名字，比如说class、href
如果所提取的信息为当前位置的文本中：
用/text()提取
xpath可以提取大部分信息，但是无法提取到js动态加载，或者异步加载的信息，但是也有方法可以提取到这些信息，具体方法后续再说。
代码：

for j in range(1,61):
            p_price=parser.xpath('//*[@id="J_ItemList"]/div['+str(j)+']/div/p[1]/em/text()')[0]
                               # //*[@id="J_ItemList"]/div[1]/div/p[1]/em/text()
            #                   //*[@id="J_ItemList"]/div[2]/div/p[1]/em/text()
            p_name=parser.xpath('//*[@id="J_ItemList"]/div['+str(j)+']/div/p[2]/a/@title')[0]
            #                   //*[@id="J_ItemList"]/div[1]/div/p[2]/a
            #                   //*[@id="J_ItemList"]/div[2]/div/p[2]/a
            p_sales=parser.xpath('//*[@id="J_ItemList"]/div['+str(j)+']/div/p[3]/span[1]/em/text()')[0]
            # //*[@id="J_ItemList"]/div[1]/div/p[3]/span[1]/em
            # //*[@id="J_ItemList"]/div[2]/div/p[3]/span[1]/em
            p_comments=parser.xpath('//*[@id="J_ItemList"]/div['+str(j)+']/div/p[3]/span[2]/a/.')[0]
            # //*[@id="J_ItemList"]/div[1]/div/p[3]/span[2]/a
            # //*[@id="J_ItemList"]/div[2]/div/p[3]/span[2]/a
            p_link=parser.xpath('//*[@id="J_ItemList"]/div['+str(j)+']/div/div[1]/a/@href')[0]

因为一页有很多商品，商品的销量，价格，评论数，商品名称的位置都是相对固定的，我们可以根据xpath中的数字规律，利用循环获取当前页面所有商品的上述信息。
由于商品的人气值，店铺的评分，以及用户的评论需要点击进入商品页面才能获取，所以我们也提取出当前页面所有商品的网址，为了后续提取信息

如果程序报错：

list index out of range

则爬取失败，说明被反爬机制检测到，一般为验证码拦截，笔者能力有限，无法解决这种问题，只能尝试更改headers的信息，重新尝试，如果还是不行，就过一段时间在尝试爬取

python获取天猫后台数据 python爬取天猫商品_xpath_05

目前我们爬取到了上述信息

接下来我们开始爬取商品页面里面的信息：

同样利用xpath方法，可以爬到大部分信息，比如店铺评分（在网页顶部位置）

但是人气值，和商品评论：

python获取天猫后台数据 python爬取天猫商品_python_06

利用xpath提取出来是空列表

这是因为异步加载的问题，右键检查->network

找到如下文件：

python获取天猫后台数据 python爬取天猫商品_python_07

发现：文件preview内容最后的数字就是所需要的人气值

经过多个商品的确认，确实如此

所以思路为：

获取每个商品该文件的链接，下载该网页的内容，用正则表达式匹配所需要的数字：

for item in url:
        print(item)
        response = gethtml(item, headers)
        parser = etree.HTML(response, etree.HTMLParser())

        base='https://count.taobao.com/counter3?_ksTS=1621334983260_268&callback=jsonp269&keys=SM_368_dsr-901409638,ICCP_1_'
        item_id=parser.xpath('//*[@id="J_AddFavorite"]/@data-aldurl')[0]
        reg = re.compile(r"\d*[ ]")
        target_id=re.search(reg,item_id).group()
        # //*[@id="J_DetailMeta"]/div[1]/div[2]/p
        # //*[@id="J_CollectCount"]
        target_url=base+target_id
        print(target_url)

        a = urllib.request.urlopen(target_url)  # 打开指定网址
        po_html = a.read()  # 读取网页源码
        po_html = po_html.decode("utf-8")  # 解码为unicode码
        print(po_html)  # 打印网页源码

笔者在此处用了urllib包中的urlopen函数来获取网页源代码，
因为笔者尝试过用requests中的get函数，但是得到的网页源码和看到的不相同，具体表现为：其他内容一致，唯独人气值的数字为0
但用urlopen函数就不会出问题。
接下来发掘该网页链接的规律：
规律已经在代码中体现：
base表示所有商品页面人气值网页的共通部分
ICCP_1_后面接的是商品的ID
用item_id表示，可以在网页中用xpath初步提取，再用正则表达式分离
目标人气值的url就是base+target_id
该规律通过多观察几个商品的人气值网址链接不难发现
注意提取人气值时：

def gethtml(url,headers):
    try:
        response=requests.get(url,headers=headers)
        response.encoding='utf-8'
        if response.status_code==200:
            return response.text
        return None
    except RequestException:
        return None

注意encoding

python获取天猫后台数据 python爬取天猫商品_python获取天猫后台数据_08