因学校教务系统课程表查询功能累赘,服务器经常挂,同时也不适合手机端查询,所以用python开发爬虫抓取所有课程表,放到我的服务器上面。 本文仅供学习。

特性

中途退出程序再次运行不会抓取到重复课程表

使用MySQL数据库储存数据,方便管理课程表

注意

请勿复制粘贴本文章代码,因为python对于代码缩进要求非常严格。

请到我的Gihub上面下载源代码 (附班级课表数据库文件):

web端查询课程表

操作环境:

Windows 10 专业版

Python 3.5.1

WampSever 2.5

WampSever 安装

储存课程表需要用到Mysql数据库,所以要下载Wampsever。

进入网站,往下拉,找到如图所示的位置

根据你系统选择安装32位或者64位的,如果你不知道你系统是多少位,下载32位的即可。

安装完毕后,运行它。

数据库表的导入

运行wampsever后,在桌面右下角状态栏中,鼠标左键点击wampserver图标,如下图,点击phpMyAdmin,进入web版数据库管理页面。



进入到web界面后,左上角处点击new,新建名字为python的数据库



按照图中设置。


选择sql文件导入数据库。

Python 安装

点击Download Python 3.5.1 开始下载,

安装时记得勾选Add python.exe to Path

Python插件安装

安装Python 完毕后,需要安装下列插件。

以管理员模式打开CMD命令提示符,执行命令

pip install beautifulsoup4

PyMySQL 数据库连接插件

以管理员模式打开CMD命令提示符,执行命令

pip install PyMySQL

原理分析

教务管理信息系统课程表链接分析

从链接的尾部*** .aspx?xsbh=14040801&xq=15-16-2 中的问号?可以看出该网页是用GET的方式请求数据的,第一个参数是xsbh = 14040801,第二个参数是xq = 15-16-2,xq明显是学期拼音首字母缩写,而xsbh应该就是班级编号了。所以我们要先获取所有的班级编号和学期数,然后拼接成班级课表链接,进行遍历抓取课程表:http://202.193.177.7:8081/(S(oaml0r55uwfphs55ssjh43mh))/web_jxrw/cx_kb_bjkb_bj.aspx?xsbh=班级编号&xq=学期

抓取所有学期

打开链接后,查看网页源代码,所有学期数都是放在ID为DDxq的select标签下的option标签(如图),


然后出现下图,


因为在前端开发中,id这个属性的值是唯一的,所以使用Beautiful Soup选择此处的代码为soup.find(id="DDxq").find_all("option")

这就可以用beautifulsoup插件提取所有学期。

抓取所有学期python代码

spiderForSemester.py
Python
import urllib
import urllib.request
from bs4 import BeautifulSoup
import pymysql.cursors
connection = pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
def sqlLink(semester):
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `semester` (`semester`) VALUES (%s)"
cursor.execute(sql, (semester))
connection.commit()
finally:
# connection.close()
pass
return
response = urllib.request.urlopen('http://172.16.129.117/web_jxrw/cx_kb_bjxzall.aspx')
if response.code == 200:
print ("succeed",response.code)
html = response.read()
soup = BeautifulSoup(
html,
'html.parser',
from_encoding = 'gb2312')
arr = soup.find(id="DDxq").find_all("option")
for x in arr:
ch = x.get_text()
print(ch)
sqlLink(str(ch))
else:
print ("failed")
importurllib
importurllib.request
frombs4importBeautifulSoup
importpymysql.cursors
connection=pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
defsqlLink(semester):
try:
withconnection.cursor()ascursor:
sql="INSERT INTO `semester` (`semester`) VALUES (%s)"
cursor.execute(sql,(semester))
connection.commit()
finally:
# connection.close()
pass
return
response=urllib.request.urlopen('http://172.16.129.117/web_jxrw/cx_kb_bjxzall.aspx')
ifresponse.code==200:
print("succeed",response.code)
html=response.read()
soup=BeautifulSoup(
html,
'html.parser',
from_encoding='gb2312')
arr=soup.find(id="DDxq").find_all("option")
forxinarr:
ch=x.get_text()
print(ch)
sqlLink(str(ch))
else:
print("failed")

抓取所有班级和班级编号

同样查看教务系统查询课表页面源代码,所有班级都放在ID为Cxbj_all1_bjlist1的select标签下(如图),直接用beautifulsoup抓取。


图中的option标签value属性就是班级编号,text内容就是班级名字。同样使用beautifulsoup,代码:soup.find(id="Cxbj_all1_bjlist1").find_all("option")

抓取所有班级python代码

Python
import urllib
import urllib.request
from bs4 import BeautifulSoup
import pymysql.cursors
connection = pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
def sqlLink(clssName,classCode,grade):
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `class` (`className`,`classCode`,`grade`) VALUES (%s,%s,%s)"
cursor.execute(sql, (clssName,classCode,grade))
connection.commit()
finally:
# connection.close()
pass
return
response = urllib.request.urlopen('http://172.16.129.117/web_jxrw/cx_kb_bjxzall.aspx')
if response.code == 200:
print ("succeed",response.code)
html = response.read()
soup = BeautifulSoup(
html,
'html.parser',
from_encoding = 'gb2312')
arr = soup.find(id="Cxbj_all1_bjlist1").find_all("option")
for x in arr:
ch = x.get_text()
classCode = x['value']
last = x['value'][0]+x['value'][1]
if last == "98":
grade = "19"+last
pass
else:
grade = "20"+last
print(classCode)
sqlLink(str(ch),classCode,grade)
else:
print ("failed")
importurllib
importurllib.request
frombs4importBeautifulSoup
importpymysql.cursors
connection=pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
defsqlLink(clssName,classCode,grade):
try:
withconnection.cursor()ascursor:
sql="INSERT INTO `class` (`className`,`classCode`,`grade`) VALUES (%s,%s,%s)"
cursor.execute(sql,(clssName,classCode,grade))
connection.commit()
finally:
# connection.close()
pass
return
response=urllib.request.urlopen('http://172.16.129.117/web_jxrw/cx_kb_bjxzall.aspx')
ifresponse.code==200:
print("succeed",response.code)
html=response.read()
soup=BeautifulSoup(
html,
'html.parser',
from_encoding='gb2312')
arr=soup.find(id="Cxbj_all1_bjlist1").find_all("option")
forxinarr:
ch=x.get_text()
classCode=x['value']
last=x['value'][0]+x['value'][1]
iflast=="98":
grade="19"+last
pass
else:
grade="20"+last
print(classCode)
sqlLink(str(ch),classCode,grade)
else:
print("failed")

抓取所有课程表

有了班级列表和学期这些数据,我们就可以遍历班级列表和学期列表来拼接成课程表查询链接,提取每个链接中的课程。


同理,在id为GVkb的table标签下,每天的课程都会放在td标签下。

Python代码为soup.find(id="GVkb").find_all("td")

抓取所有课程表python代码

此处分为两个python文件, 请按照注释命名文件,第一个是spider.py,第二个是spider_classTable.py,要放在同一目录下运行。

第一个文件spider.py

Python
import pymysql.cursors
import spider_classTable
connection = pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
try:
### get all semester
with connection.cursor() as cursor:
sql = "SELECT * FROM `semester`"
cursor.execute(sql)
resultSemester = cursor.fetchall()
# for x in resultSemester:
# print(x["semester"])
# pass
### get all classCode
with connection.cursor() as cursor:
sql = "SELECT `classCode` FROM `class`"
cursor.execute(sql)
resultClassCode = cursor.fetchall()
# for x in resultClassCode:
# print(x["classCode"])
# pass
finally:
# connection.close()
i = 0
pass
for x in resultClassCode:
classCode = x["classCode"]
# print("classCode = ",classCode)
for y in resultSemester:
semester = y["semester"]
# print("Semester = ",semester)
gradeClassCode = int(classCode[0]+classCode[1])
gradeSemester = int(semester[0]+semester[1])
try:
with connection.cursor() as cursor:
sql = "SELECT `classCode`,`semester` FROM `timetable` WHERE classCode='"+classCode+"' AND semester='"+semester+"'"
cursor.execute(sql)
resultRepeat = cursor.fetchone()
finally:
# connection.close()
pass
if (gradeSemester >= gradeClassCode) and (gradeSemester <= gradeClassCode+3):
if resultRepeat is None:
arr = [classCode,semester]
print(arr)
spider_classTable.insertTameTable(classCode,semester)
else:
print("The timetable had already exist")
i += 1
print(i,"Processing...")
importpymysql.cursors
importspider_classTable
connection=pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
try:
### get all semester
withconnection.cursor()ascursor:
sql="SELECT * FROM `semester`"
cursor.execute(sql)
resultSemester=cursor.fetchall()
# for x in resultSemester:
#     print(x["semester"])
#     pass
### get all classCode
withconnection.cursor()ascursor:
sql="SELECT `classCode` FROM `class`"
cursor.execute(sql)
resultClassCode=cursor.fetchall()
# for x in resultClassCode:
#     print(x["classCode"])
# pass
finally:
# connection.close()
i=0
pass
forxinresultClassCode:
classCode=x["classCode"]
# print("classCode = ",classCode)
foryinresultSemester:
semester=y["semester"]
# print("Semester = ",semester)
gradeClassCode=int(classCode[0]+classCode[1])
gradeSemester=int(semester[0]+semester[1])
try:
withconnection.cursor()ascursor:
sql="SELECT `classCode`,`semester` FROM `timetable` WHERE classCode='"+classCode+"' AND semester='"+semester+"'"
cursor.execute(sql)
resultRepeat=cursor.fetchone()
finally:
# connection.close()
pass
if(gradeSemester>=gradeClassCode)and(gradeSemester<=gradeClassCode+3):
ifresultRepeatisNone:
arr=[classCode,semester]
print(arr)
spider_classTable.insertTameTable(classCode,semester)
else:
print("The timetable had already exist")
i+=1
print(i,"Processing...")
第二个文件spider_classTable.py
Python
import urllib
import urllib.request
from bs4 import BeautifulSoup
import pymysql.cursors
def openLink(response):
try:
urllib.urlopen(response)
return 0
except e:
return 1
pass
def insertTameTable(classCode,semester):
connection = pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
# response = urllib.request.urlopen('http://172.16.129.117/web_jxrw/cx_kb_bjkb_bj.aspx?xsbh='+classCode+'&xq='+semester)
response = urllib.request.Request('http://172.16.129.117/web_jxrw/cx_kb_bjkb_bj.aspx?xsbh='+classCode+'&xq='+semester)
flag = 1
while (flag == 1):
try:
print("opening link...")
response = urllib.request.urlopen(response)
flag = 0
except:
print("failed open the link. retrying...")
flag = 1
pass
if response.code == 200:
print ("succeed to open the url, code=",response.code)
html = response.read()
soup = BeautifulSoup(
html,
'html.parser',
from_encoding = 'gbk')
arr = soup.find(id="GVkb").find_all("td")
note = soup.find(id="TextBox1")
timeTable = []
for x in arr:
ch = x.get_text()
strings = str(ch).replace("\xa0","none")+"\n"
timeTable.append(strings)
if len(timeTable) == 8:
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `timetable` (`classCode`,`semester`,`section`,`one`,`two`,`three`,`four`,`five`,`six`,`seven`) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
cursor.execute(sql, (classCode,semester,timeTable[0],timeTable[1],timeTable[2],timeTable[3],timeTable[4],timeTable[5],timeTable[6],timeTable[7]))
connection.commit()
finally:
timeTable = []
print("insert succeed!")
pass
try:
with connection.cursor() as cursor:
sql = "INSERT INTO `timetablenote` (`classCode`,`semester`,`note`) VALUES (%s,%s,%s)"
cursor.execute(sql, (classCode,semester,note.get_text()))
connection.commit()
finally:
print("note insert succeed!")
pass
else:
print("failed to open the url, code=", response.code)
pass
importurllib
importurllib.request
frombs4importBeautifulSoup
importpymysql.cursors
defopenLink(response):
try:
urllib.urlopen(response)
return0
excepte:
return1
pass
definsertTameTable(classCode,semester):
connection=pymysql.connect(
host='localhost',
user='root',
password='',
db='python',
charset="utf8",
cursorclass=pymysql.cursors.DictCursor)
# response = urllib.request.urlopen('http://172.16.129.117/web_jxrw/cx_kb_bjkb_bj.aspx?xsbh='+classCode+'&xq='+semester)
response=urllib.request.Request('http://172.16.129.117/web_jxrw/cx_kb_bjkb_bj.aspx?xsbh='+classCode+'&xq='+semester)
flag=1
while(flag==1):
try:
print("opening link...")
response=urllib.request.urlopen(response)
flag=0
except:
print("failed open the link. retrying...")
flag=1
pass
ifresponse.code==200:
print("succeed to open the url, code=",response.code)
html=response.read()
soup=BeautifulSoup(
html,
'html.parser',
from_encoding='gbk')
arr=soup.find(id="GVkb").find_all("td")
note=soup.find(id="TextBox1")
timeTable=[]
forxinarr:
ch=x.get_text()
strings=str(ch).replace("\xa0","none")+"\n"
timeTable.append(strings)
iflen(timeTable)==8:
try:
withconnection.cursor()ascursor:
sql="INSERT INTO `timetable` (`classCode`,`semester`,`section`,`one`,`two`,`three`,`four`,`five`,`six`,`seven`) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"
cursor.execute(sql,(classCode,semester,timeTable[0],timeTable[1],timeTable[2],timeTable[3],timeTable[4],timeTable[5],timeTable[6],timeTable[7]))
connection.commit()
finally:
timeTable=[]
print("insert succeed!")
pass
try:
withconnection.cursor()ascursor:
sql="INSERT INTO `timetablenote` (`classCode`,`semester`,`note`) VALUES (%s,%s,%s)"
cursor.execute(sql,(classCode,semester,note.get_text()))
connection.commit()
finally:
print("note insert succeed!")
pass
else:
print("failed to open the url, code=",response.code)
pass

运行抓取课程表python代码

等待抓取。我大概花了6.5个小时才把所有的课程表抓取下来。建议在晚上准备睡觉的时候运行,直到第二天早上。