闲来无聊,刚好有个朋友来问爬虫的事情,说起来了CBA这两年的比赛数据,做个分析,再来个大数据啥的。来了兴趣,果然搞起来,下面分享一下爬虫的思路。
1、选取数据源
这里我并不懂CBA,数据源选的是国内某门户网站的CBA专栏,下面会放链接地址,有兴趣的可以去看看。
2、分析数据
经过查看页面元素,发现页面是后台渲染,没办法通过接口直接获取数据。下面就要分析页面元素,看到所有的数据都是存在表格里面的,这下就简单了很多。
3、确定思路
思路比较简单,通过正则把所有行数据都提取出来,过滤掉无用的修饰信息,得到的就是想要的数据。此处我把每行的列符合替换成了“,”方便用csv记录数据。
经过过滤之后的数据如下:
球队,第一节,第二节,第三节,第四节,总比分广州,33,37,36,27,133北控,23,18,17,34,922019-01-1619:35:00轮次:31场序309开始比赛 比赛已结束首发,球员,出场时间,两分球,三分球,罚球,进攻,篮板,助攻,失误,抢断,犯规,盖帽,得分,张永鹏,25.8,7-9,0-0,1-1,4,8,3,0,0,1,0,15,鞠明欣,19.1,2-4,1-2,0-0,2,5,2,2,0,1,0,7,西热力江,25.5,1-1,4-8,0-0,1,2,4,1,3,1,0,14,郭凯,15.5,2-2,0-0,0-0,2,3,0,2,0,2,0,4,凯尔·弗格,38.1,5-9,5-9,11-11,0,10,12,2,2,4,0,36,姚天一,12.3,0-1,1-4,0-0,0,1,5,0,0,0,0,3,科里·杰弗森,24.0,4-4,2-4,3-4,0,6,0,1,0,1,1,17,陈盈骏,22.6,1-1,2-7,1-1,0,2,4,2,1,2,0,9,司坤,19.0,2-2,0-2,0-0,0,5,1,0,1,4,0,4,孙鸣阳,20.6,2-3,0-0,3-3,1,4,1,2,3,4,0,7,谷玥灼,7.4,1-1,1-2,0-0,0,0,2,0,0,0,0,5,郑准,10.1,3-4,2-3,0-0,0,2,0,0,0,1,0,12,总计,240.0,30-41(73.2%),18-41(43.9%),19-20(95.0%),10,48,34,12,10,21,1,133首发,球员,出场时间,两分球,三分球,罚球,进攻,篮板,助攻,失误,抢断,犯规,盖帽,得分,于梁,20.8,1-3,0-1,0-0,0,0,2,0,1,5,0,2,于澍龙,17.9,0-1,1-3,0-0,0,2,1,2,0,1,0,3,许梦君,46.2,1-3,5-12,0-0,1,6,2,1,0,3,0,17,托马斯·罗宾逊,43.4,9-20,0-2,9-14,3,11,5,2,1,3,1,27,杨敬敏,16.0,3-4,0-3,0-0,0,0,0,2,0,1,0,6,孙贺男,2.8,0-0,0-0,0-0,0,0,0,1,0,1,0,0,刘大鹏,28.0,1-1,3-5,0-0,1,4,3,2,2,3,0,11,张铭浩,8.5,0-0,0-0,1-2,0,0,0,0,1,1,0,1,张帆,27.5,5-7,1-3,0-0,0,1,6,4,1,2,0,13,王征,23.3,3-3,0-0,6-8,0,2,0,0,1,1,1,12,常亚松,5.6,0-0,0-1,0-0,0,1,0,1,2,0,0,0,总计,240.0,23-42(54.8%),10-30(33.3%),16-24(66.7%),5,27,19,15,9,21,2,92
下面分享自己代码:
1package com.fun 2 3import com.fun.frame.Save 4import com.fun.frame.httpclient.FanLibrary 5import com.fun.utils.Regex 6import com.fun.utils.WriteRead 7 8class sd extends FanLibrary { 910 public static void main(String[] args) {11 int i = 112 def total = []13 range(300, 381).forEach {x ->14 total.addAll test(x)15 }16 Save.saveStringList(total, "total4.csv")17 testOver()18 }192021 static def test(int i) {22 if (new File(LONG_Path + "${i}.csv").exists()) return WriteRead.readTxtFileByLine(LONG_Path + "${i}.csv")23 String url = "http:///game/content/2017/${i}"2425 def get = getHttpGet(url)2627 def response = getHttpResponse(get)282930 def string = response.getString("content").replaceAll("\\s", EMPTY)31// output(string)32 def all = Regex.regexAll(string, "<tr.*?<\\/tr>")33 def list = []34 all.forEach {x ->35 def info = x.replaceAll("</*?tr.*?>", EMPTY).replaceAll("</t(d|h)>", ",")36 info = info.replaceAll("<.*?>", EMPTY)3738 info = info.charAt(info.length() - 1) == ',' ? info.substring(0, info.length() - 1) : info39 if (info.startsWith("总计")) info = "," + info40 list << info41 output(info)4243 }44 Save.saveStringList(list, "${i}.csv")45 return list46 }4748}
















