以下是 Java 爬虫的一些知识点:

HTTP 协议:Java 爬虫需要了解 HTTP 协议,包括请求方法、请求头、响应码等。

HTML 解析:Java 爬虫需要解析 HTML 页面,获取需要的数据。常用的 HTML 解析库有 Jsoup、HtmlUnit 等。

网络请求库:Java 爬虫需要使用网络请求库发送 HTTP 请求,常用的库有 HttpURLConnection、OkHttp 等。

数据存储:Java 爬虫需要将获取到的数据存储起来,常用的存储方式有文件、数据库、缓存等。

反爬虫:网站为了防止爬虫,可能会采取一些反爬虫措施,如验证码、IP 封锁、User-Agent 检测等。Java 爬虫需要应对这些反爬虫措施。

多线程:Java 爬虫需要使用多线程提高爬取速度,常用的多线程库有 ThreadPoolExecutor、ForkJoinPool 等。

定时任务:Java 爬虫需要定时爬取数据,常用的定时任务库有 Quartz、ScheduledExecutorService 等。

代理:Java 爬虫需要使用代理来隐藏自己的 IP 地址,避免被封禁。常用的代理库有 HttpClient、OkHttp 等。

以上是 Java 爬虫的一些知识点,希望能对你有所帮助。

java爬虫知识盲区整理_System

首先说说HttpClient和浏览器的区别

我们从浏览器发起一笔请求,浏览器则会帮你处理重定向、缓存等事情。这也就是为什么用浏览器表单post提交后,不管服务端如何重定向,都能正常接收到服务端返回的数据。

但是用HttpClient呢,你会发现,请求后,会返回302,因为POST方式提交HttpClient是不会帮你处理重定向的。这时候怎么办呢?

方法一:(自己手动处理)

HttpClient httpClient = HttpClients.createDefault();

        HttpPost httpPost= new HttpPost(http://ip:port/xxx);

        CloseableHttpResponse response = httpclient.execute(httpPost);

        int statusCode = response.getStatusLine().getStatusCode();
        System.out.println("statusCode=="+statusCode); //返回码

        Header header=response.getFirstHeader("Location");

        //重定向地址
        String location =  header.getValue();
        System.out.println(location);

        //然后再对新的location发起请求即可

        HttpGet httpGet = new HttpGet(location);
        CloseableHttpResponse response2 = httpclient.execute(httpGet);
        System.out.println("返回报文"+EntityUtils.toString(response2.getEntity(), "UT-F-8"));

方法二:(已有工具类)

HttpClientBuilder builder = HttpClients.custom()
            .disableAutomaticRetries() //关闭自动处理重定向
            .setRedirectStrategy(new LaxRedirectStrategy());//利用LaxRedirectStrategy处理POST重定向问题

       CloseableHttpClient client = builder.build();

        HttpPost httpPost= new HttpPost(http://ip:port/xxx);

        CloseableHttpResponse response = client.execute(httpPost);

        int statusCode = response.getStatusLine().getStatusCode();
        System.out.println("statusCode=="+statusCode); //返回码

         System.out.println("返回报文"+EntityUtils.toString(response.getEntity(), "UT-F-8"));

HttpClient获取Cookie的两种方式

一、旧版本的HttpClient获取Cookies

p.s. 该方式官方已不推荐使用

使用DefaultHttpClient类实例化httpClient对象:

public static String dooPost_deprecated(String url, Map<String, String> map, String charset) {
        DefaultHttpClient httpClient = null;
        HttpPost httpPost = null;
        String result = null;
        try {
            httpClient = new DefaultHttpClient();
            httpPost = new HttpPost(url);
            // 设置参数
            List<NameValuePair> list = new ArrayList<NameValuePair>();
            Iterator<Entry<String, String>> iterator = map.entrySet().iterator();
            while (iterator.hasNext()) {
                Entry<String, String> elem = (Entry<String, String>) iterator.next();
                list.add(new BasicNameValuePair(elem.getKey(), elem.getValue()));
            }
            if (list.size() > 0) {
                UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list, charset);
                httpPost.setEntity(entity);
            }
            HttpResponse response = httpClient.execute(httpPost);
            System.out.println(response.getStatusLine().getStatusCode());
            String JSESSIONID = null;
            String cookie_user = null;
            //获得Cookies
            CookieStore cookieStore = httpClient.getCookieStore();
            List<Cookie> cookies = cookieStore.getCookies();
            for (int i = 0; i < cookies.size(); i++) {
                //遍历Cookies
                System.out.println(cookies.get(i));
                System.out.println("cookiename=="+cookies.get(i).getName());
                System.out.println("cookieValue=="+cookies.get(i).getValue());
                System.out.println("Domain=="+cookies.get(i).getDomain());
                System.out.println("Path=="+cookies.get(i).getPath());
                System.out.println("Version=="+cookies.get(i).getVersion());

                if (cookies.get(i).getName().equals("JSESSIONID")) {
                    JSESSIONID = cookies.get(i).getValue();
                }
                if (cookies.get(i).getName().equals("cookie_user")) {
                    cookie_user = cookies.get(i).getValue();
                }
            }
            if (cookie_user != null) {
                result = JSESSIONID;
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        return result;
    }

二、新版本的HttpClient获取Cookies

使用CloseableHttpClient类实例化httpClient对象:

public static String doPost(Map<String, String> map, String charset) {
        CloseableHttpClient httpClient = null;
        HttpPost httpPost = null;
        String result = null;
        try {
            CookieStore cookieStore = new BasicCookieStore();
            httpClient = HttpClients.custom().setDefaultCookieStore(cookieStore).build();
            httpPost = new HttpPost("http://localhost:8080/testtoolmanagement/LoginServlet");
            List<NameValuePair> list = new ArrayList<NameValuePair>();
            Iterator<Map.Entry<String, String>> iterator = map.entrySet().iterator();
            while (iterator.hasNext()) {
                Entry<String, String> elem = (Entry<String, String>) iterator.next();
                list.add(new BasicNameValuePair(elem.getKey(), elem.getValue()));
            }
            if (list.size() > 0) {
                UrlEncodedFormEntity entity = new UrlEncodedFormEntity(list, charset);
                httpPost.setEntity(entity);
            }
            httpClient.execute(httpPost);
            String JSESSIONID = null;
            String cookie_user = null;
            List<Cookie> cookies = cookieStore.getCookies();
            for (int i = 0; i < cookies.size(); i++) {
                if (cookies.get(i).getName().equals("JSESSIONID")) {
                    JSESSIONID = cookies.get(i).getValue();
                }
                if (cookies.get(i).getName().equals("cookie_user")) {
                    cookie_user = cookies.get(i).getValue();
                }
            }
            if (cookie_user != null) {
                result = JSESSIONID;
            }
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        return result;
    }