有一天某时段一个频道的CDN流量猛增几百兆,分析CDN日志查看是哪个URL使用的流量最多。输出格式为:URL:访问地址 count:这URL访问次数 flow:总的流量M 。实现思路为:把日志每条记录split url为key,流量为value保存到词典中。排序后输出。
提示:频道日志约为8G,我机器才4G内存,用了readlines方法,处理起来较慢用几分钟才完成,有很多可以优化的空间。下面是statistics_flow.py的代码,结果直接打印到屏幕了,如需结果可以重定向到文本文件中,希望脚本的思路和写法给大家一些提示或帮助。
#!/usr/bin/python #coding:utf-8 #Author by Qfeian@20140310 """ Usage python statistics_flow.py log_path """ import sys from operator import itemgetter if len(sys.argv) < 2: print __doc__ sys.exit(1) log = sys.argv[1] f = open(log,'r') url_flow = {} url_num = {} def sort_kv(dict, str=False): return sorted(dict.iteritems(), key=itemgetter(1), reverse=str) for line in f.readlines(): url = line.split()[6] flow = line.split()[9] if url in url_flow: url_num[url] += 1 url_flow[url] = int(url_flow[url] + int(flow)) else: url_num[url] = 1 url_flow[url] = int(flow) #sort and print sort_flow = sort_kv(url_flow, True) for url,flow in sort_flow: print "URL: %s count: %d the flow: %.3fM" % (url,int(url_num[url]), float(flow)/(1000)) #The total number of url print "The total url is %d" % len(url_num)