HttpClient和HtmlParser简介

转载

有一只柴犬 2023-05-04 14:34:06

文章标签 html System 输出流 文章分类 Python 后端开发

最近想做一个爬虫工具，看了下相关的资料，现在简单介绍下两个重要的开源组件--HttpClient和HtmlClient,如下
1、HttpClient是apache的一个开源项目，可以功能丰富而且高效的支持http协议客户端编程工具，实现了各种基于http的交互，包括GET、POST、PUT、HEAD等方式，而且追踪自动转向（也就是所谓的redirect,forward等），支持https协议（一种基于http协议，但比http协议安全，即加密了），而且支持代理（请求或接受数据的时候，可以指定代理服务器）等，最常用的功能。、
2、HtmlParser，即Html Parser，它可以高效、准确的分析你的HTML代码，有了HttpClient之后，抓取网页内容就不是问题了，抓取不是目的，落地的数据处理才是关键，比如，提取title，分析body正文，以及关键字等，都需要在处理html的标签的基础上去做，HtmlParser就是做这个的。
至于它们的下载就不多说了吧，下面简单的给两个demo：

（1）用HtmlParser下载网页源码内容 

 public void testGetSource(){ 

   HttpClient client=new DefaultHttpClient();//取得httpClient默认的httpClient 

    

   HttpGet httpGet=new HttpGet("一个基于Http的get请求 

   HttpResponse response=null; 

   try { 

    response=client.execute(httpGet);//用httpClient发送httGet请求，相当于你在地址栏中输入http://www.baidu.com/ 

                                     //response相当于你的所返回结合的一个集合，包括返回的一切信息 

   } catch (ClientProtocolException e) { 

    e.printStackTrace(); 

   } catch (IOException e) { 

    e.printStackTrace(); 

   } 

   HttpEntity entity=response.getEntity();//得到结果的主要内容 

   InputStream ins=null; 

   try { 

    ins=entity.getContent();//取得内容的输出流 

   } catch (IllegalStateException e) { 

    e.printStackTrace(); 

   } catch (IOException e) { 

    e.printStackTrace(); 

   } 

   String pageCharSet=EntityUtils.getContentCharSet(entity);//通过EntitiUtils工具类，可以得到所请求页面的编码charset,是为传输数据做准备， 

                  //如果charset不对应的话，很容易出现乱码 

   BufferedReader br=null;                                                   

   try { 

    br=new BufferedReader(new InputStreamReader(ins,pageCharSet));//用缓冲流做为输出流，方便处理 

    String lineString=""; 

    while((lineString=br.readLine())!=null){ 

     System.out.println(lineString);//输出到命令行 

    } 

    br.close();//关闭两个流 

    ins.close(); 

   } catch (UnsupportedEncodingException e) { 

    e.printStackTrace(); 

   } catch (IOException e) { 

    e.printStackTrace(); 

   } 

  } 

 （2）用HtmlParser得到所有该html的所有链接 

         第一种简单方法： 

               public void testGetByLinkBeans() throws IOException { 

   LinkBean bean = new LinkBean();//link的一个专用处理类 

   bean.setURL("以百度网址为处理对象 

   URL[] urls = bean.getLinks();//得到百度的链接集合 

   for (int i = 0; i < urls.length; i++) { 

    System.out.println("toString is : " + urls[i].toString());//将每个链接列出来 

   } 

  } 

   第二种是用过滤器的方式， 

     public void testGetLinksAndText(){ 

   SimpleClient client=new SimpleClient(url);//可以认为是httpClient，也就是第一个demo例子中的httpClient 

   String htmlString=client.getHtmlString();//也就是取得源码 

    

   Parser parser=Parser.createParser(htmlString,client.getPageCharSet());//用源码和页面的字符编码初始化一个Parser处理器 

   HtmlPage page=new HtmlPage(parser);//htmlPage相当于parser处理后得到的结果页 

    

   try { 

    parser.visitAllNodesWith(page);//真正去遍历下htmlString的标签 

   } catch (ParserException e) { 

    e.printStackTrace(); 

   } 

   NodeList nodeList=page.getBody();//得到html源码的body内容 

    

   NodeFilter filter=new TagNameFilter("A");//加一个NameTag过滤器，即用“A”超链接去过滤下nodeList,即只要"A"标答的内容 

   nodeList=nodeList.extractAllNodesThatMatch(filter);//真正去过滤 

    

   for(int i=0;i<nodeList.size();i++){//编历得到的所有“A”标签 

    LinkTag link=(LinkTag)nodeList.elementAt(i);//取得每一个link 

 //   link. 

    System.out.println("link-"+(i+1)+":"+link.getLink()+" : "+link.getLinkText());//将href和link里的链接串都列出来 

   } 

  }

很显然，第二种比第一种要麻烦些，但是由于目的不一样，采取的方法也就不一样了。就介绍到这吧。