爬虫---04.代理操作

转载

mob604756e97f09 2021-11-01 11:29:00

文章标签 代理服务器服务器 html ip地址实例代码 文章分类 代码人生

代理操作
- 在爬虫中代理就是代理服务器
- 用来转发请求和响应的
- 爬虫对服务器发起高频请求，那么服务器会检测到这样的一个异常的行为。会对设备限制，无法再次请求。
- ip被禁，就可以使用代理服务器进行请求转发，破解IP被禁反爬机制。
- 代理服务器分类
  - 透明代理：服务器知道你使用了代理机制，也知道你真实IP
  - 匿名代理：知道你使用代理，但是不知道真实IP
  - 高匿代理：不知道你用代理，也不知道真实IP
- 代理的类型
  - https：代理只能转发https协议的请求
  - http：转发http的请求

实例代码

                          url = ""    
                          page_text = requests.get(url, headers=headers).text
                          tree = etree.HTML(page_text)
                          proxy_lst = tree.xpath("//div[@class='']//text()")          # 以上步骤是从代理服务器提取IP地址
                          http_proxy = []
                          for proxy in proxy_lst:
                              dic = {
                                  'http': proxy
                              }
                              http_proxy.append(dic)
                          print(http_proxy)

                          url = ""
                          ips = []
                          for page in range(1, 11):
                              new_url = format(url % page)
                              page_text = requests.get(url=new_url, headers=headers, proxies={'http':ip:port}).text              # 另外可以随机取 proxies = random.choice(http_proxy)
                              tree = etree.HTML(page_text)
                              # 在xpath表达式中不可以出现tbody标签
                              tr_list = tee.xpath('//*[@id="ip_list"]')
                              for tr in tr_lst:
                                  ip = tr.xpath()
                                  ips.append(ip)
                          print(len(ips))