记一次失败的爬虫

原创

FunTester 2021-12-10 18:08:44 ©著作权

文章标签 json 自动化测试 linux apache 性能测试 文章分类 运维

©著作权归作者所有：来自51CTO博客作者FunTester的原创作品，请联系作者获取转载授权，否则将追究法律责任

收到一天振奋人心的假新闻，导致我去找了公开信息网站定点药店的信息，虽然结果比较失败，过程还是挺欢乐的，记录下来又可以水一篇文章了。以下是原文：

页面搜索功能有限，我就做了一个爬虫，思路分了两步，先找药店名称的编号，再查药店的具体地址。

这里面只有773条网点信息，我顺手做了个爬虫，代码粗糙，十几分钟完成的，效果还可以。代码分享如下：

package com.fun
import com.fun.frame.Saveimport com.fun.frame.httpclient.FanLibraryimport com.fun.utils.Regeximport com.fun.utils.WriteReadimport net.sf.json.JSONObjectimport org.apache.http.client.methods.HttpGetimport org.apache.http.client.methods.HttpPost
class sd extends FanLibrary {
static names = []
public static void main(String[] args) {
        test2(1)//        52.times {//            test(it+1)//        }//        Save.saveStringList(names,"hospital")
def line = WriteRead.readTxtFileByLine(LONG_Path + "hospital")        line.each {def s = it.split(" ", 3)[1]            output(s)            it += test2(changeStringToInt(s))            names << it        }        Save.saveStringList(names,"address.txt")
        testOver()    }
static def test(int i) {        String url = "http://ybj.***./ddyy/ddyy2/list";
        HttpPost httpPost = getHttpPost(url, getJson("page=" + i));
        JSONObject response = getHttpResponse(httpPost);def all = response.getString("content").replaceAll("\\s", EMPTY)
def infos = Regex.regexAll(all, "<tr>.*?</tr>")        infos.remove(0)        output(infos)        output(infos.size())        infos.each {x ->            names << x.substring(4).replaceAll("<.*?>", SPACE_1)        }        output(names.size())    }static def test2(int i) {        String url = "http://ybj.***./ddyy/ddyy2/findByName?id="+i;
        HttpGet httpGet = getHttpGet(url);
        JSONObject response = getHttpResponse(httpGet);def all = response.getString("content").replaceAll("\\s", EMPTY)
def infos = Regex.regexAll(all, "<tr>.*?</tr>")//        infos.remove(0)        output(infos)//        output(infos.size())def address = EMPTY        infos.each {x ->//            names << x.substring(4).replaceAll("<.*?>", SPACE_1)           address += x.substring(4).replaceAll("<.*?>", SPACE_1)        }        output(address)return address    }}

下面是两次爬虫的页面结构：

记一次失败的爬虫_自动化测试记一次失败的爬虫_自动化测试_02