有一天某时段一个频道的CDN流量猛增几百兆,分析CDN日志查看是哪个URL使用的流量最多。输出格式为:URL:访问地址  count:这URL访问次数 flow:总的流量M 。实现思路为:把日志每条记录split url为key,流量为value保存到词典中。排序后输出。

   提示:频道日志约为8G,我机器才4G内存,用了readlines方法,处理起来较慢用几分钟才完成,有很多可以优化的空间。下面是statistics_flow.py的代码,结果直接打印到屏幕了,如需结果可以重定向到文本文件中,希望脚本的思路和写法给大家一些提示或帮助。



#!/usr/bin/python
#coding:utf-8
#Author by Qfeian@20140310
"""
Usage
python statistics_flow.py log_path
"""
import sys
from operator import itemgetter
if len(sys.argv) < 2:
    print __doc__
    sys.exit(1)
log = sys.argv[1]
f = open(log,'r')
url_flow = {}
url_num = {}
def sort_kv(dict, str=False):
    return sorted(dict.iteritems(), key=itemgetter(1), reverse=str)
for line in f.readlines():
    url = line.split()[6]
    flow = line.split()[9]
    if url in url_flow:
        url_num[url] += 1
        url_flow[url] = int(url_flow[url] + int(flow))
    else:
        url_num[url] = 1
        url_flow[url] = int(flow)
#sort and  print
sort_flow = sort_kv(url_flow, True)
                                           
for url,flow in sort_flow:
    print "URL: %s count: %d the flow: %.3fM" % (url,int(url_num[url]), float(flow)/(1000))
#The total number of url
print "The total url is %d" % len(url_num)