需求

统计某游戏平台新用户渠道来源

日志格式如下:

 



1. Jul 23 0:00:47  [info] {SPR}gjzq{SPR}20130723000047{SPR}85493108{SPR}S1{SPR}{SPR}360wan-2j-reg{SPR}58.240.209.78{SPR}



 

 分析

问题的关键在于先找出新用户

新用户:仅在7月份登陆过平台的用户为新用户

依据map/reduce思想,可以按照如下方式找出新用户:



  • 假如某用户在某月份出现过,则(qid,year,month)=1
  • 按qid汇总该用户出现过的月数,即构建(qid,count(year,month))对
  • 新用户的count(year,month)=1,且(year,month)=(2013,07).



找出新用户的来源渠道

来源渠道:新用户在201307可能多次登录平台,需要找出最早登陆平台所属渠道

分两步来做:

  • 找出新用户所有登陆记录(qid,logintime,src)
  • 针对同一qid找出logintime最小时的src

实现

  1. 数据准备

    1)建表

  



1. create table if not exists glogin_daily (year int,month int,day int,hour int,logintime string,qid int,gkey string,skey string,loginip string,registsrc string,loginfrom string) partitioned by


     依据日志内容及所关心的信息创建表格,按天组织分区

   2 ) 数据导入

    因日志文件存在于多处,我们先将日志汇总到一临时目录,创建临时外部表将数据加载进hive,然后通过正则匹配的方式分隔出各字段。(内部表只能load单文件,通过这种方式可以load文件夹)




1. echo "==== load data into tmp table $TMP_TABLE ==="
2. "create external table $TMP_TABLE (info string) location '${TMP_DIR}';"
3. "==== M/R ==="
4. 1-4`  
5. 5-6`  
6. 7-8`  
7. "${CURR_YEAR}-${CURR_MONTH}-${CURR_DAY}"
8. "add file ${SCRIPT_PATH}/${MAP_SCRIPT_FILE};set hive.exec.dynamic.partition=true;insert overwrite table glogin_daily partition (dt='${dt}') select transform (t.i) using '$MAP_SCRIPT_PARSER ./${MAP_SCRIPT_FILE}' as (y,m,d,h,t,q,g,s,ip,src,f) from (select info as i from ${TMP_TABLE}) t;"


  

其中filter_login.php:


    1. $fr=fopen("php://stdin","r");  
    2. $month_dict = array(  
    3. 'Jan'
    4. 'Feb'
    5. 'Mar'
    6. 'Apr'
    7. 'May'
    8. 'Jun'
    9. 'Jul'
    10. 'Aug'
    11. 'Sep'
    12. 'Oct'
    13. 'Nov'
    14. 'Dec'
    15. );  
    16. while(!feof($fr))  
    17. {  
    18. $input = fgets($fr,256);  
    19. $input = rtrim($input);  
    20. //Jul 23 0:00:00  [info] {SPR}xxj{SPR}20130723000000{SPR}245396389{SPR}S9{SPR}iwan-ng-mnsg{SPR}cl-reg-xxj0if{SPR}221.5.67.136{SPR}
    21. if(preg_match("/([^ ]+) +(\d+) (\d+):.*\{SPR\}([^\{]*)\{SPR\}(\d+)\{SPR\}(\d+)\{SPR\}([^\{]*)\{SPR\}([^\{]*)\{SPR\}(([^\{]*)\{SPR\}([^\{]*)\{SPR\})?/",$input,$matches))  
    22.     {  
    23. $year = substr($matches[5],0,4);  
    24. echo $year."\t".$month_dict[$matches[1]]."\t".$matches[2]."\t".$matches[3]."\t".$matches[5]."\t".$matches[6]."\t".$matches[4]."\t".$matches[7]."\t".$matches[11]."\t".$matches[8]."\t".$matches[10]."\n";  
    25.     }  
    26. }  
    27. fclose ($fr);


     

    2.找出新用户

     

    1)用户登陆平台记录按月消重汇总

     





    1. create table distinct_login_monthly_tmp_07 as select qid,year,month from glogin_daily group by qid,year,month;



     
    2)用户登陆平台月数

     



    1. create table login_stat_monthly_tmp_07 as select qid,count(1) as c from distinct_login_monthly_tmp_07 where year<2013 or (year=2013 and month<=7) group by


     
    平台级新用户:
    1)找出登陆月数为1的用户;

    2.判断这些用户是否在7月份出现,如果有出现,找出登陆所有src

     



    1. create table new_player_monthly_07 as select distinct a.qid,b.src,b.logintime from (select qid from login_stat_monthly_tmp_07 where c=1) a join (select qid,loginfrom as src,logintime from glogin_daily where month=7 and year=2013) b on


     
    找出最早登陆的src:

     



    1. add
    2. create table new_player_src_07 as select transform (t.qid,t.src,t.logintime) using 'php ./get_player_src.php' as (qid,src,logintime) from (select * from new_player_monthly_07 order by


     

    其中get_player_src.php:



    1. $fr=fopen("php://stdin","r");  
    2. $curr_qid
    3. $curr_src
    4. $curr_logintime=null;  
    5. while(!feof($fr))  
    6. {  
    7. $input = fgets($fr,1024);  
    8. $input = rtrim($input);  
    9. $arr   = explode("\t", $input);  
    10. $qid   = trim($arr[0]);  
    11. if(emptyempty($curr_qid)||$curr_qid != $qid)  
    12.         {  
    13. $curr_qid = $qid;  
    14. echo $input."\n";  
    15.         }  
    16. }  
    17. fclose ($fr);



     
    平台级新用户数:



    1. select count(*) from



     
    平台级各渠道新用户汇总:





    1. create table new_player_src_stat_07 as select src,count(*) from new_player_monthly_07 group by