根据该过程,爬虫过程需要分为两步:
1、通过关键词(Java)搜索问题,得到url=https://www.zhihu.com/search?type=content&q=java,根据该url爬取该页面下所有的问题及其对应的问题id;
2、根据第一步得到的问题及其id,得到url=https://www.zhihu.com/question/31437847,爬取该url页面下所有的网友回答。
具体代码如下(https://github.com/tianyunzqs/crawler/tree/master/zhihu)
#!usr/bin/env python
# -*- coding: utf-8 -*-
import re
from urllib import request, parse
from bs4 import BeautifulSoup
keyword_list = ['svm', '支持向量机', 'libsvm']
fout = open("E:/python_file/zhihu.txt", "w", encoding="utf-8")
for keyword in keyword_list:
print(keyword)
url = 'https://www.zhihu.com/search?type=content&q=' + parse.quote(keyword)
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/39.0.2171.95 Safari/537.36'
headers = {'User-Agent': user_agent}
keyword_question_url_list = {}
try:
req = request.Request(url, headers=headers)
response = request.urlopen(req, timeout=5)
content = response.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
all_div = soup.find_all('li', attrs={'class': re.compile('item clearfix.*?')})
question_url_list = {}
for e_div in all_div:
title = e_div.find_all('a', attrs={'class': 'js-title-link',
'target': '_blank',
'href': re.compile('/question/[0-9]+')})
if title:
title = title[0].text
_id = e_div.find_all('link', attrs={'itemprop': 'url',
'href': re.compile('/question/[0-9]+/answer/[0-9]+')})
href = _id[0].attrs.get('href')
pattern = re.compile('/question/(.*?)/answer/(.*?)$', re.S)
items = re.findall(pattern, href)
question_id = items[0][0]
question_url_list[title] = 'https://www.zhihu.com/question/' + question_id
else:
title_id = e_div.find_all('a', attrs={'class': 'js-title-link',
'target': '_blank',
'href': re.compile('https://zhuanlan.zhihu.com/p/[0-9]+')})
if title_id:
title = title_id[0].text
href = title_id[0].attrs.get('href')
question_url_list[title] = href
else:
continue
keyword_question_url_list[keyword] = question_url_list
# for q, d in question_url_list.items():
# print(q, d)
except:
continue
for keyword, question_url_list in keyword_question_url_list.items():
for question, url in question_url_list.items():
fout.write(question + "\n")
try:
req = request.Request(url, headers=headers)
with request.urlopen(req, timeout=5) as response:
content = response.read().decode('utf-8')
soup = BeautifulSoup(content, 'html.parser')
all_div = soup.find_all('div', attrs={'class': 'List-item'})
for e_div in all_div:
answer = e_div.find_all('span', attrs={'class': 'RichText CopyrightRichText-richText',
'itemprop': 'text'})
answer = answer[0].text
fout.write(answer + "\n")
except request.URLError as e:
if hasattr(e, "code"):
print(e.code)
if hasattr(e, "reason"):
print(e.reason)
存在的问题:
以上程序可以很好完成第一步,但是第二步只能取到问题的前2个回答。
根据的介绍,应该可以用Selenium+Phantomjs来解决,以后再尝试。