(1)GeccoEngine->run()
1.默认采用proxys文件代理集合
2.scheduler的设置,在循环状态下scheduler = new StartScheduler()
否则 scheduler = new NoLoopStartScheduler();
3若spiderBeanFactory为空,则进行初始化
4.设置cdl的值,数目为线程数目
5.将starts.json文件转化为List,添加到startRequests
6.遍历startRequests,分别调用scheduler.into(startRequest);
7.初始化spiders = new ArrayList(threadCount)
8.实例化每一个Spider并启动相应的线程
9.设置启动时间
10.监控爬虫基本信息,导出JMX信息
11 非循环模式等待线程执行完毕后关闭
(2)currSpiderBeanClass = engine.getSpiderBeanFactory().matchSpider(request);解析
1 String url = request.getUrl();
2.遍历spiderBeans。spiderBeans是通过Gecco中的run方法的第3步得到的spiderBeanFactory = new SpiderBeanFactory(classpath, pipelineFactory);
3.将url和spiderbeans中的key值进行匹配,request.setParam(param),param是匹配的结果,然后返回request.。spiderbeans中的key是@Gecco注解中的matchurl属性对应的值
4. 返回key对应的spider
(3)HttpResponse response = currDownloader.download(request, timeout);解析
1.启动时通过MonitorDownloaderFactory使用cglib生成HttpClientDownloader的代理类,在调用HttpClientDownload的down之后会执行DownloadMonitor.incrSuccess(request.getUrl());这个方法
(4)spiderBean = render.inject(currSpiderBeanClass, request, response);解析
核心方法
private Object injectHtmlField(HttpRequest request, HttpResponse response, Field field, Class<? extends SpiderBean> clazz) {
HtmlField htmlField = field.getAnnotation(HtmlField.class);
String content = response.getContent();
HtmlParser parser = new HtmlParser(request.getUrl(), content);
// parser.setLogClass(clazz);
String cssPath = htmlField.cssPath();
Class<?> type = field.getType();// 属性的类
boolean isArray = type.isArray();// 是否是数组类型
boolean isList = ReflectUtils.haveSuperType(type, List.class);// 是List类型
if (isList) {
Type genericType = field.getGenericType();// 获得包含泛型的类型
Class genericClass = ReflectUtils.getGenericClass(genericType, 0);// 泛型类
if (ReflectUtils.haveSuperType(genericClass, SpiderBean.class)) {
// List
return parser.KaTeX parse error: Expected 'EOF', got '}' at position 46: …ericClass); }̲ else { // …basicList(cssPath, field);
} catch (Exception ex) {
//throw new FieldRenderException(field, content, ex);
FieldRenderException.log(field, content, ex);
}
}
} else if (isArray) {
Class genericClass = type.getComponentType();
if (ReflectUtils.haveSuperType(genericClass, SpiderBean.class)) {
List list = parser.KaTeX parse error: Expected 'EOF', got '}' at position 156: …toArray(a); }̲ else { // …basicList(cssPath, field).toArray();
} catch (Exception ex) {
//throw new FieldRenderException(field, content, ex);
FieldRenderException.log(field, content, ex);
}
}
} else {
if (ReflectUtils.haveSuperType(type, SpiderBean.class)) {
// SpiderBean
return parser.KaTeX parse error: Expected 'EOF', got '}' at position 64: …an>) type); }̲ else { // …basic(cssPath, field);
} catch (Exception ex) {
//throw new FieldRenderException(field, content, ex);
FieldRenderException.log(field, content, ex);
}
}
}
return null;
}
判断一个请求是ajax请求:看header部分是否有X-Requested-With:XMLHttpRequest