收到一天振奋人心的假新闻,导致我去找了公开信息网站定点药店的信息,虽然结果比较失败,过程还是挺欢乐的,记录下来又可以水一篇文章了。以下是原文:


页面搜索功能有限,我就做了一个爬虫,思路分了两步,先找药店名称的编号,再查药店的具体地址。

这里面只有773条网点信息,我顺手做了个爬虫,代码粗糙,十几分钟完成的,效果还可以。代码分享如下:

package com.fun
import com.fun.frame.Saveimport com.fun.frame.httpclient.FanLibraryimport com.fun.utils.Regeximport com.fun.utils.WriteReadimport net.sf.json.JSONObjectimport org.apache.http.client.methods.HttpGetimport org.apache.http.client.methods.HttpPost
class sd extends FanLibrary {
static names = []
public static void main(String[] args) {
test2(1)// 52.times {// test(it+1)// }// Save.saveStringList(names,"hospital")
def line = WriteRead.readTxtFileByLine(LONG_Path + "hospital") line.each {def s = it.split(" ", 3)[1] output(s) it += test2(changeStringToInt(s)) names << it } Save.saveStringList(names,"address.txt")
testOver() }
static def test(int i) { String url = "http://ybj.***./ddyy/ddyy2/list";
HttpPost httpPost = getHttpPost(url, getJson("page=" + i));
JSONObject response = getHttpResponse(httpPost);def all = response.getString("content").replaceAll("\\s", EMPTY)
def infos = Regex.regexAll(all, "<tr>.*?</tr>") infos.remove(0) output(infos) output(infos.size()) infos.each {x -> names << x.substring(4).replaceAll("<.*?>", SPACE_1) } output(names.size()) }static def test2(int i) { String url = "http://ybj.***./ddyy/ddyy2/findByName?id="+i;
HttpGet httpGet = getHttpGet(url);
JSONObject response = getHttpResponse(httpGet);def all = response.getString("content").replaceAll("\\s", EMPTY)
def infos = Regex.regexAll(all, "<tr>.*?</tr>")// infos.remove(0) output(infos)// output(infos.size())def address = EMPTY infos.each {x ->// names << x.substring(4).replaceAll("<.*?>", SPACE_1) address += x.substring(4).replaceAll("<.*?>", SPACE_1) } output(address)return address }}

下面是两次爬虫的页面结构:

记一次失败的爬虫_自动化测试记一次失败的爬虫_自动化测试_02