一只垂直的小爬虫

原创

Oolong6868 2022-09-22 12:15:22 ©著作权

©著作权归作者所有：来自51CTO博客作者Oolong6868的原创作品，请联系作者获取转载授权，否则将追究法律责任

这只垂直的小爬虫,使用如下实现

HttpClient点击进入官方文档
Jsoup点击进入官方文档
多线程
jdbc

实现的思路很简单,我从主函数开始简单叙述一下整个运行流程,第一步:收集需要爬取的url地址,容器我选择的是ConcurrentLinkedQueue非阻塞队列,它底层使用Unsafe实现,要的就是它线程安全的特性

主函数代码如下:

static String url = "http://www.qlu.edu.cn/38/list.htm";
    // 添加url任务
      public static ConcurrentLinkedQueue<String>  add( ConcurrentLinkedQueue<String> queue){
            for (int i=1;i<=19;i++){
                String subString = StringUtils.substringBefore(url, ".htm");
                queue.add(subString+i+".htm");
            }
          return queue;
      }
      
public static void main(String[] args) throws IOException {
        ConcurrentLinkedQueue<String> queue = new ConcurrentLinkedQueue();
        queue.add(url);
        ConcurrentLinkedQueue<String> newQueue = add(queue);
        // 多线程下载解析
        TPoolForDownLoadRootUrl.downLoadRootTaskPool(queue);

    }

第二步:把url列表丢线程池:

我使用的线程池是newCachedThreadPool 根据提交的任务数,动态分配线程

线程池里面干了这么几件事,下载源html

/**
 *  下载html的业务实现
 * @Author: Changwu
 * @Date: 2019/3/24 11:13
 */
public class downLoadHtml {
    public static Logger logger = Logger.getLogger(downLoadHtml.class);
    /**
     * 根据url 下载网页源码
     * @param url
     * @return
     */
    public static String downLoadHtmlByUrl(String url) throws IOException {
        CloseableHttpClient httpClient = HttpClients.createDefault();
        HttpGet httpGet = new HttpGet(url);
        //设置请求头
        httpGet.setHeader("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36");

        CloseableHttpResponse response = httpClient.execute(httpGet);
        logger.info("请求"+url+"状态码为"+response.getStatusLine().getStatusCode());
        HttpEntity entity = response.getEntity();
        String result = EntityUtils.toString(entity, "utf-8");
        return  result;
    }

解析rootUrl,目的是拿到新闻主页的url,因为新闻的正文,在那里面,边解析遍封装RootBean

/**
     * 解析源html.封装成一级Bean对象并返回
     *
     * @param sourceHtml
     * @return
     */
    public static List<RootBean> getRootBeanList(String sourceHtml) {
        LinkedList<RootBean> rootBeanList = new LinkedList<>();
        Document doc = Jsoup.parse(sourceHtml);
        Elements elements = doc.select("#wp_news_w6 ul li");
        String rootUrl = "http://www.qlu.edu.cn";

        for (Element element : elements) {
            RootBean rootBean = new RootBean();
            // 获取url并拼装
            String href = element.child(0).child(0).attr("href");
            // 获取title
            String title = element.text();
            String[] split = title.split("\\s+");
            //封装
            System.out.println(title);

            if (split.length >= 2) {
                String s = element.outerHtml();
                String regex = "class=\"news_meta\">.*";
                Pattern compile = Pattern.compile(regex);
                Matcher matcher = compile.matcher(s);
                if (matcher.find()) {

                    String group = matcher.group(0);
                    String ss = StringUtils.substring(group, 18);
                    ss = StringUtils.substringBefore(ss, "</span> </li>");
                    rootBean.setPostTime(ss);
                }

            }


            rootBean.setTitle(split[0]);
            rootBean.setUrl(rootUrl + href);

            rootBeanList.add(rootBean);
            /*System.out.println();
            System.out.println(split[0]);
            System.out.println();*/
        }
        return rootBeanList;
    }

类似,处理二级任务,这里使用到了正则表达式,原来没好好学,今天用的时候,完全蒙,还好慢慢悠悠整出来了,这块这要是观察源html,根据特性,使用jsoup提供的选择器选择,剪切,拼接出我们想要的内容,然后封装

为啥说是垂直的小爬虫,它只适合爬取我学校新闻,看下面的代码,没办法,只能拼凑剪切,最坑的是,100条新闻中,99条标题放在里面,总有那么一条放在了里面, 这个时候,就不得不去改刚才写好的规则

/**
     * 解析封装二级任务
     *
     * @param htmlSouce
     * @return
     */
    public static List<PojoBean> getPojoBeanByHtmlSource(String htmlSouce, RootBean bean) {

        LinkedList<PojoBean> list = new LinkedList<>();
        PojoBean pojoBean = new PojoBean();

        // 解析
        Document doc = Jsoup.parse(htmlSouce);

        // 编辑
        Elements elements1 = doc.select(".arti_metas");

        for (Element element : elements1) {

            String text = element.text();

            // 编辑
            String regex = "(责任编辑：.*)";
            Pattern compile = Pattern.compile(regex);
            Matcher matcher = compile.matcher(text);
            String editor = null;
            if (matcher.find()) {
                //System.out.println(matcher.group(group));
                editor = matcher.group(1);
                editor = StringUtils.substring(editor, 5);
                //System.out.println(editor);
            }

            // 作者
            regex = "(作者：.*出处)";
            compile = Pattern.compile(regex);
            matcher = compile.matcher(text);
            String author = null;
            if (matcher.find()) {
                //System.out.println(matcher.group(group));
                author = matcher.group(1);
                author = StringUtils.substring(author, 3);
                author = StringUtils.substringBefore(author, "出处");
                //System.out.println(author);
            }

            // 出处
            regex = "(出处：.*责任编辑)";
            compile = Pattern.compile(regex);
            matcher = compile.matcher(text);
            String source = null;
            if (matcher.find()) {
                source = matcher.group(1);
                source = StringUtils.substring(source, 3);
                source = StringUtils.substringBefore(source, "责任编辑");
                //  System.out.println(source);
            }

            // 正文
            Elements EBody = doc.select(".wp_articlecontent");
            String body = EBody.first().text();
            // System.out.println(body);

            // 封装
            pojoBean.setAuthor(author);
            pojoBean.setBody(body);
            pojoBean.setEditor(editor);
            pojoBean.setSource(source);
            pojoBean.setUrl(bean.getUrl());
            pojoBean.setPostTime(bean.getPostTime());
            pojoBean.setTitle(bean.getTitle());
            list.add(pojoBean);
        }
        return list;
    }
}

持久化,使用的是底册的JDBC

/**
     * 持久化单个pojo
     * @param pojo
     */
    public static void insertOnePojo(PojoBean pojo) throws ClassNotFoundException, SQLException {
        // 注册驱动
        Class.forName("com.mysql.jdbc.Driver");
        // 连接
        Connection connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/spider", "root", "root");
        String sql = "insert into qluspider (title,url,post_time,insert_time,author,source,editor,body) values (?,?,?,?,?,?,?,?)";
        PreparedStatement ps = connection.prepareStatement(sql);
        // 填充sql
        ps.setString(1,pojo.getTitle());
        ps.setString(2,pojo.getUrl());
        // 把字符串转换成日期
        ps.setTimestamp(3,new java.sql.Timestamp(SpiderUtil.stringToDate(pojo.getPostTime()).getTime()));
        ps.setTimestamp(4,new java.sql.Timestamp(new Date().getTime()));
        ps.setString(5,pojo.getAuthor());
        ps.setString(6,pojo.getSource());
        ps.setString(7,pojo.getEditor());
        ps.setString(8,pojo.getBody());

        ps.execute();

        connection.close();

    }

拿到的新的url称作是二级

public static Logger logger = Logger.getLogger(TPoolForDownLoadRootUrl.class);

    /**
     * 下载,解析 根url的线程池
     */
    public static void downLoadRootTaskPool(ConcurrentLinkedQueue queue) {
        ExecutorService executor = Executors.newCachedThreadPool();
        //ExecutorService executor = Executors.newFixedThreadPool(5);
        for (  int i=1;i<=queue.size();i++)
        {
            executor.execute(new Runnable() {
                @Override
                public void run() {
                    try {
                        logger.info("1号线程池开启,将要下载解析root任务");
                        // 获取根任务url
                        String url = (String) queue.poll();

                        logger.info("根URL==" + url);
                        if (StringUtils.isNotBlank(url)) {
                            // 下载当前url对应的rootHtml
                            String sourceHtml = downLoadHtml.downLoadHtmlByUrl(url);
                            // 解析rootHtml里面所有的RootBean对象
                            List<RootBean> rootBeanList = parseHtmlByJsoup.getRootBeanList(sourceHtml);
                            // 二级任务开始
                            for (RootBean rootBean : rootBeanList) {
                                logger.info(this + "进入二级任务");
                                String subUrl = rootBean.getUrl();
                                // 下载二级任务 html
                                String htmlSouce = downLoadHtml.downLoadHtmlByUrl(subUrl);
                                // 解析封装
                                List<PojoBean> pojoList = parseHtmlByJsoup.getPojoBeanByHtmlSource(htmlSouce, rootBean);
                                // 持久化
                                logger.info(this + "将持久化" + subUrl + "中的二级任务");
                                Persistence.insertPojoListToDB(pojoList);
                                logger.info("持久化完成.......");
                            }
                        }
                    } catch (IOException e) {
                        System.out.println();
                        e.printStackTrace();
                    }

                }
            });

        }