1、java将URL网页博客转化为pdf文件


2、测试博客网页地址为


3、测试结果为

java中文域名转url java url转pdf_java中文域名转url

java中文域名转url java url转pdf_pdf_02


4、工程代码结构为:

java中文域名转url java url转pdf_博客_03


5、部分代码展示:

public static String[] extractBlogInfo(String blogURL) throws Exception {
		String[] info = new String[4];
		//报错:Exception in thread "main" org.jsoup.HttpStatusException:HTTP error fetching URL. Status=403, URL=javascript:void(0)/
//		org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();
//		爬取某个网站太快,会被封。于是要模拟像人一样的取爬取某个网站,那样的话估计几秒爬取一个网页
//		参考http://blog.sina.com.cn/s/blog_664fdc7e0102vesz.html
		org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31").timeout(10000).get();
		org.jsoup.nodes.Element e_title = doc.select("span.link_title").first();
		info[0] = e_title.text();

		org.jsoup.nodes.Element category_r = doc.select("div.category_r").first();
		info[1] = category_r.after("label").after("span").text().replace("作者同类文章X", "");

		org.jsoup.nodes.Element e_date = doc.select("span.link_postdate").first();
		info[2] = e_date.text();
		org.jsoup.nodes.Element entry = doc.select("div.article_content").first();
		info[3] = formatContentTag(entry);
		info[3]="<?xml version=\"1.0\" encoding=\"UTF-8\"?>"  
				+"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">"  
				+"<html xmlns=\"http://www.w3.org/1999/xhtml\">  "
				+"<head>  "
				+"<style>  "
				+"body{  "
				+"font-family:SimSun;  "
				+"font-size:14px;  "
				+"}  "
				+"</style>  "
				+"<meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></meta></head><body>"+info[3]+"</body></html>";
      
			System.out.println("info.toString():"+info[0]+",\n"+info[1]+",\n"+info[2]+",\n"+info[3]+",\n");
		return info;
	}




6、不能使用org.jsoup.nodes.Document doc = Jsoup.connect(blogURL).get();,因为爬取某个网站太快,会被封。于是要模拟像人一样的取爬取某个网站,那样的话估计几秒爬取一个网页。


7、需要在网页部分添加,避免无法显示中文。


<?xml version=\"1.0\" encoding=\"UTF-8\"?> 
				<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">  
				<html xmlns=\"http://www.w3.org/1999/xhtml\"> 
				<head> 
				<style> 
				body{  
				font-family:SimSun;  
				font-size:14px;  
				}  
				</style>  
				<meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></meta></head><body>"+info[3]</body></html>