python抓网页资源小脚本

原创

testqa_cn 2022-12-20 10:58:55 博主文章分类：python应用 ©著作权

文章标签 python Python web Web WEB 文章分类 Java 后端开发

©著作权归作者所有：来自51CTO博客作者testqa_cn的原创作品，请联系作者获取转载授权，否则将追究法律责任

#!/usr/bin/env python
# coding: utf-8
import urllib

def filter_src(file_name):
    resource_list = []
    f_obj = open(file_name)
    for f_line in f_obj:
        if '404' in f_line:
            str_goal = f_line.strip().split(' ')[7]
            if not str_goal in resource_list:
                print str_goal
                if '/static' in str_goal:
                    str_goal = str_goal.replace('/static', '')
                resource_list.append(str_goal[:-1])
    print resource_list
    return resource_list

def down_src(source_list):
    base_url = "http://www.ttcrm.com"
    down_path = r"src"
    for source in source_list:
        source_url = base_url + source
        source_path = down_path + source
        print source_url
        source_stram = urllib.urlopen(source_url)
        f_obj = open(source_path,'wb')
        f_obj.write(source_stram.read())
        


if __name__=='__main__':
    file_name = 'src.txt'
    source_list = filter_src(file_name)
    down_src(source_list)

关键点在于保存是以二进制方式保存！

f_obj = open(source_path,'wb')
        f_obj.write(source_stram.read())

上一篇：有效软件测试 - 50条建议 - 需求阶段

下一篇：python中的编解码攻略

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯