资料

Java爬虫:使用WebMagic构建最简单的爬虫项目_百度
WebMagic的架构设计参照了Scrapy

项目主页:http://webmagic.io/
github地址:https://github.com/code4craft/webmagic
项目文档:http://webmagic.io/docs/zh/

Java爬虫:使用WebMagic构建最简单的爬虫项目_github_02

环境配置

使用 IntelliJ IDEA 新建maven项目

1、依赖文件配置
WebMagicSpider/pom.xml

<dependencies>
	<dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-core</artifactId>
        <version>0.7.3</version>
    </dependency>
    
    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-extension</artifactId>
        <version>0.7.3</version>
    </dependency>

    <dependency>
        <groupId>us.codecraft</groupId>
        <artifactId>webmagic-extension</artifactId>
        <version>0.7.3</version>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-log4j12</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

</dependencies>

2、日志文件配置
WebMagicSpider/src/main/resources/log4j.properties

log4j.rootLogger=WARN, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
项目构建

1、爬虫程序编写
WebMagicSpider/src/main/java/BaiduPageProcessor.java

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.Spider;
import us.codecraft.webmagic.pipeline.ConsolePipeline;
import us.codecraft.webmagic.pipeline.JsonFilePipeline;
import us.codecraft.webmagic.processor.PageProcessor;

public class BaiduPageProcessor implements PageProcessor {

    private Site site = Site.me()
            .setRetryTimes(1)
            .setSleepTime(1000)
            .setCharset("utf-8");

    public void process(Page page) {
        page.putField("title", page.getHtml().css("title", "text").toString());
    }

    public Site getSite() {
        return site;
    }

    public static void main(String[] args) {
        Spider.create(new BaiduPageProcessor())
                .addUrl("http://www.baidu.com/")
                .addPipeline(new ConsolePipeline())
                .addPipeline(new JsonFilePipeline("/Users/qmp/myproject/WebMagicSpider"))
                .thread(1)
                .run();
    }
}

2、执行程序

控制台输出

get page: http://www.baidu.com/
title:	百度一下,你就知道

文件输出

{"title":"百度一下,你就知道"}