Hadoop离线项目之数据清洗

1.背景

1.1 企业级大数据项目开发流程

  1. 项目调研:技术?业务?
    以业务为导向的项目调研,根据不同的行业对项目进行调研,主要以 产品经理、非常熟悉业务的、项目经理等人员进行项目调研.
  2. 需求分析:明确做什么 做成什么样子的(做东西不要局限于某个技术)
    首先需求是由用户提出,是可见的,明确规定了需要什么,其次是隐式的需求,隐式的需求需要开发人员自己来实现,要根据业界的一些东西能制定.
  3. 方案设计
    1.概念设计:系统的模块模块,模块中有哪些功能点.
    2.详细设计:具体到所有功能的实现,本套系统用哪些技术来搞定,每个功能点涉及到哪些表哪些模块,表的字段方法名接口等都要定义好,甚至是类名以及方法名.
    3.系统设计:能否扩展,能否容错,可不可以定制化,监控告警等等都是这里的.
  4. 功能开发(文档代码化)
    1.开发:代码层次
    2.测试(本地环境):单元测试 ,程序员自测, 自动测试所有的单元测试,测试没问题后才可以上线.
    3.测试(测试人员的测试):包括功能测试,联调测试(部门间的联调测试),性能测试(一般是压力屙屎,需要去线上调整性能资源),用户测试(主要是用户的体验测试)
  5. 部署上线
    1.试运行,新的和老的系统一起运行,一般新系统跟旧系统一起运行一个月以上,保证两个系统的结果完全一致,同事还要比较两者的稳定性以及差异,保证新系统的稳定.
    2.正式上线 ,灰度测试, 一般一年或者一年半(会可能用到容器docker之类,开箱即用的;比如需要用到一些相同环境的机器)
  6. 后期
    还有项目二期,三期,四期,运维保障,功能开发,bug修改.

1.2 企业级大数据应用平台

  1. 数据分析
  1. 自研:自己研发的平台,基于开源框架进行二次开发,后期方便构建用户画像
  2. 商业:商业软件,维护方便,但是数据在第三方,缺点明显.
  1. 搜索/爬虫
    包括但不限于elk / hbase / soler / luncen 等工具的使用
  2. 机器学习/深度学习
  3. 人工智能
  4. 离线处理
    根据表面上毫无关联的大量数据,挖掘数据间更深层次的价值
  5. 实时处理

2.基于Maven构建大数据开发项目

1.IDEA创建maven项目

idea的使用不多做介绍,开发必须会,构建完之后如下图:

hadoop数据清洗方案 hadoop 数据清洗_hadoop

2.maven pom文件配置

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <maven.compiler.source>1.7</maven.compiler.source>
    <maven.compiler.target>1.7</maven.compiler.target>
<!--指定hadoop.version,保证添加的Hadoop依赖的版本相同-->  
    <hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
    <hive.version>1.1.0-cdh5.7.0</hive.version>
  </properties>

  <!--添加CDH的仓库,因为maven的中央仓库并不包含cdh的仓库,所以必须额外指定-->
    <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
    </repository>
  </repositories>

  <dependencies>
    <!--添加Hadoop的依赖-->
        <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
  </dependency>

    <dependency>
      <groupId>org.apache.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>${hive.version}</version>
    </dependency>

    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.11</version>
      <scope>test</scope>
    </dependency>
  </dependencies>

自己本地开发使用的版本无需与生产环境的版本一致,当然一致更好,maven工程打包分为胖包和瘦包,胖包(jar包和代码全打包),瘦包 (一般是打成瘦包,只打包自己开发的代码不管jar包)

3.造数据

既然是数据清洗,数据必不可少,构建数据采用java代码开发

package com.ruozedata.hadoop.utils;

import java.io.FileWriter;
import java.io.IOException;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Locale;
import java.util.Random;

public class ProduceLog {
	//cdn的厂商构建
    public static String create_cdn(){
        String stringarray[] = {"baidu","tencent","wangyi","ali","jingdong"};
        Random random = new Random();
        String result = stringarray[random.nextInt(5)];
        return result;
    }
	
    //url地址构建
    public static String create_url() {
        String result = "";
        Random random = new Random();
        for (int i = 0; i <= 35; i++) {
            String str = random.nextInt(2) % 2 == 0 ? "num" : "char";
            if ("char".equalsIgnoreCase(str)) { // 产生字母
                int nextInt = random.nextInt(2) % 2 == 0 ? 65 : 97;
                result += (char) (nextInt + random.nextInt(26));
            } else if ("num".equalsIgnoreCase(str)) { // 产生数字
                result += String.valueOf(random.nextInt(10));
            }
        }
        result = "http://v1.go2yd.com/user_upload/"+result;
        return result;
    }


    /*
     * 将十进制转换成IP地址
     */
    public static String num2ip(int ip) {
        int[] b = new int[4];
        String x = "";
        b[0] = (int) ((ip >> 24) & 0xff);
        b[1] = (int) ((ip >> 16) & 0xff);
        b[2] = (int) ((ip >> 8) & 0xff);
        b[3] = (int) (ip & 0xff);
        x = Integer.toString(b[0]) + "." + Integer.toString(b[1]) + "." + Integer.toString(b[2]) + "." + Integer.toString(b[3]);

        return x;
    }
	
    //ip地址构建
    public static String create_ip() {
        // ip范围
        int[][] range = {{607649792, 608174079}, // 36.56.0.0-36.63.255.255
                {1038614528, 1039007743}, // 61.232.0.0-61.237.255.255
                {1783627776, 1784676351}, // 106.80.0.0-106.95.255.255
                {2035023872, 2035154943}, // 121.76.0.0-121.77.255.255
                {2078801920, 2079064063}, // 123.232.0.0-123.235.255.255
                {-1950089216, -1948778497}, // 139.196.0.0-139.215.255.255
                {-1425539072, -1425014785}, // 171.8.0.0-171.15.255.255
                {-1236271104, -1235419137}, // 182.80.0.0-182.92.255.255
                {-770113536, -768606209}, // 210.25.0.0-210.47.255.255
                {-569376768, -564133889}, // 222.16.0.0-222.95.255.255
        };

        Random rdint = new Random();
        int index = rdint.nextInt(10);
        String ip = num2ip(range[index][0] + new Random().nextInt(range[index][1] - range[index][0]));
        return ip;
    }

    //访问时间构建
    public static String create_time() {
        String result_time = "";
        try {
            DateFormat targetFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
            DateFormat sourceFormat = new SimpleDateFormat("yyyyMMddHHmmss");
            Random rndYear = new Random();
            int year = rndYear.nextInt(2) + 2017;  //生成[2017,2018]的整数;年
            Random rndMonth = new Random();
            int int_month = rndMonth.nextInt(12) + 1;
            String month = String.format("%02d",int_month);                //生成[1,12]的整数;月
            Random rndDay = new Random();
            int int_Day = rndDay.nextInt(30) + 1;       //生成[1,30)的整数;日
            String Day = String.format("%02d",int_Day);
            Random rndHour = new Random();
            int int_hour = rndHour.nextInt(23);       //生成[0,23)的整数;小时
            String hour = String.format("%02d",int_hour);
            Random rndMinute = new Random();
            int int_minute = rndMinute.nextInt(60);   //生成分钟
            String minute = String.format("%02d",int_minute);
            Random rndSecond = new Random();
            int int_second = rndSecond.nextInt(60);   //秒
            String second = String.format("%02d",int_second);
            result_time = String.valueOf(year)+month+Day+hour+minute+second;
            result_time = targetFormat.format(sourceFormat.parse(result_time));
            result_time = "["+result_time+" +0800]";
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return result_time;
    }

    public static void main(String[] args) {

        for(int i = 0 ; i <= 100; i++){
            try {
                String cdn = create_cdn();
                String region = "CN";
                String level = "E";
                String time = create_time();
                String ip = create_ip();
                String domain = "v2.go2yd.com";
                String url = create_url();
                String traffic = String.valueOf((int)((Math.random()*9+1) * 100000));

                StringBuilder builder = new StringBuilder("");
                builder.append(cdn).append("\t")
                        .append(region).append("\t")
                        .append(level).append("\t")
                        .append(time).append("\t")
                        .append(ip).append("\t")
                        .append(domain).append("\t")
                        .append(url).append("\t")
                        .append(traffic);

                String logdata = builder.toString();

                FileWriter fileWriter = new FileWriter("D:\\LegainProject\\g6-train-hadoop\\g6-train-hadoop\\src\\test\\json\\log_data.log",true);
                fileWriter.write(logdata+"\n");
                fileWriter.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }

    }

}
#构建的数据如下,按照tab分割
tencent	CN	E	[19/Mar/2018:01:06:25 +0800]	123.233.86.240	v2.go2yd.com	http://v1.go2yd.com/user_upload/7s3POuq7263Vm61UM43x6u3z6yy614b0C89n	632866

4.数据清洗工具类

package com.ruozedata.hadoop.utils;

import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Locale;

public class LogUtils {
    //源时间:使用的时候 SimpleDateFormat方法后面的Locale.ENGLISH不能丢
    DateFormat sourceFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
    //目标时间
    DateFormat targetFormat = new SimpleDateFormat("yyyyMMddHHmmss");


    /**
     * 日志文件解析,对内容进行字段的处理
     * 按\t分割
     */
    public String parse(String log) {
        String result = "";
        try {
            String[] splits = log.split("\t");
            String cdn = splits[0];
            String region = splits[1];
            String level = splits[2];
            String timeStr = splits[3];
            String time = timeStr.substring(1,timeStr.length()-7);
            time = targetFormat.format(sourceFormat.parse(time));
            String ip = splits[4];
            String domain = splits[5];
            String url = splits[6];
            String traffic = splits[7];


            StringBuilder builder = new StringBuilder("");
            builder.append(cdn).append("\t")
                    .append(region).append("\t")
                    .append(level).append("\t")
                    .append(time).append("\t")
                    .append(ip).append("\t")
                    .append(domain).append("\t")
                    .append(url).append("\t")
                    .append(traffic);

            result = builder.toString();
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return result;
    }
}

工具类测试

package com.ruozedata.hadoop.utils;

import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class TestLogUtils {

    private LogUtils utils ;

    @Test
    public void testLogParse() {
        String log = "baidu\tCN\tE\t[17/Jul/2018:17:07:50 +0800]\t223.104.18.110\tv2.go2yd.com\thttp://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4\t17168\t";
        System.out.println(logs.split("\t").length);
        System.out.println(log.split("\t").length);
        String result = utils.parse(log);
        System.out.println(result);
    }

    @Before
    public void setUp(){
        utils = new LogUtils();
    }

    @After
    public void tearDown(){
        utils = null;
    }
}

测试结果

baidu	CN	E	20180717170750	223.104.18.110	v2.go2yd.com	http://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4	17168

5.数据清洗开发

1.数据清洗这边只涉及到map,所以只有mapper:

package com.ruozedata.hadoop.mapreduce.mapper;

import com.ruozedata.hadoop.utils.LogUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class LogETLMapper extends Mapper<LongWritable,Text,NullWritable,Text>{

    /**
     * 通过mapreduce框架的map方式进行数据清洗
     * 进来一条数据就按照我们的解析规则清洗完以后输出
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int length = value.toString().split("\t").length;
        if(length == 8) {
            LogUtils utils = new LogUtils();
            String result = utils.parse(value.toString());
            if(StringUtils.isNotBlank(result)) {
                context.write(NullWritable.get(), new Text(result));
            }
        }
    }
}

2.程序入口开发

package com.ruozedata.hadoop.mapreduce.driver;

import com.ruozedata.hadoop.mapreduce.mapper.LogETLMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.yarn.util.SystemClock;

public class LogETLDriver {

    public static void main(String[] args) throws Exception{
        if(args.length != 2) {
            System.err.println("please input 2 params: input output");
            System.exit(0);
        }
        //日志目录
        String input = args[0];
        //输出目录
        String output = args[1];  
        Configuration configuration = new Configuration();
		// 判断文件系统是否存在,如果存在就删除
        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path(output);
        if(fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
        }

        Job job = Job.getInstance(configuration);
        job.setJarByClass(LogETLDriver.class);
        job.setMapperClass(LogETLMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));
        //FileOutputFormat.setCompressOutput(job, true);//设置压缩
        //FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);//压缩类型

        job.waitForCompletion(true);
    }
}

3.运行之后的部分结果

ali	CN	E	20180427223256	139.204.229.77	v2.go2yd.com	http://v1.go2yd.com/user_upload/3ycnVNfKN1M72sou8M2YBl9R47Sw39Ghy9E9	357136
wangyi	CN	E	20170916143028	36.63.193.207	v2.go2yd.com	http://v1.go2yd.com/user_upload/8ds0EPwSQI51VF9353zP4118pup7ps84122u	308845
wangyi	CN	E	20170823044319	121.76.22.38	v2.go2yd.com	http://v1.go2yd.com/user_upload/227ZN2nqv833bHgzKX1V87OCgMCQaUSpb2Q0	480903
jingdong	CN	E	20180930073052	222.75.143.120	v2.go2yd.com	http://v1.go2yd.com/user_upload/81q6657fyB62L29Cp1k6z24997Ej2iw6uar3	873249
baidu	CN	E	20180720164512	61.233.245.28	v2.go2yd.com	http://v1.go2yd.com/user_upload/037q1B717a0m9725h12kVz06Tan7heJLdWd7	807397

window机器运行hadoop代码会出现环境问题,解决办法

下载对应的win下运行环境包:https://github.com/steveloughran/winutils

将‘hadoop.dll’和‘winutils.exe’两个文件放入本地%HADOOP_HOME/bin下;同时将hadoop.dll放入C:\Windows\System32文件夹下

6.服务器环境测试

1.将项目打成jar包,这里打的是瘦包

hadoop数据清洗方案 hadoop 数据清洗_hadoop数据清洗方案_02

2.打包完之后上传到服务器集群

hadoop数据清洗方案 hadoop 数据清洗_数据清洗_03

3.服务器运行jar

#先将处理的数据文件上传到hdfs上
[hdfs@node1 scripts]$ hadoop fs -mkdir /input
[hdfs@node1 scripts]$ hadoop fs -put /opt/scripts/log_data.log /input
[hdfs@node1 scripts]$ hadoop fs -ls /input
Found 1 items
-rw-r--r--   3 hdfs supergroup      14559 2019-04-15 15:46 /input/log_data.log
#运行jar包
[hdfs@node1 root]$ hadoop jar /opt/scripts/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /input /home/hadoop/data/output

运行结果:

hadoop数据清洗方案 hadoop 数据清洗_hadoop数据清洗方案_04

hadoop数据清洗方案 hadoop 数据清洗_java_05

7.Hive完成最基本的统计分析

1.创建外部表,location不能制定mapreduce作业的输出路径,因为会覆盖掉

create external table hadoop_access (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
) partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/home/hadoop/data/clear' ;

2.移动数据导外部表目录,并刷新到hive表

hadoop fs -mkdir -p /home/hadoop/data/clear/day=20180717
hadoop fs -mv /home/hadoop/data/output/part-r-00000 /home/hadoop/data/clear/day=20180717
hive> alter table hadoop_access add if not exists partition(day='20180717');
OK
Time taken: 0.78 seconds

3.hive统计每个域名的traffic之和

hive> select domain,sum(traffic) from hadoop_access group by domain;
Query ID = hdfs_20190415160909_59a4e822-5aee-411e-b079-b07e1f678eb0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1547801454795_0012, Tracking URL = http://node1:8088/proxy/application_1547801454795_0012/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job  -kill job_1547801454795_0012
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-04-15 16:09:54,840 Stage-1 map = 0%,  reduce = 0%
2019-04-15 16:09:59,014 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.7 sec
2019-04-15 16:10:05,439 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.82 sec
MapReduce Total cumulative CPU time: 3 seconds 820 msec
Ended Job = job_1547801454795_0012
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.82 sec   HDFS Read: 21511 HDFS Write: 22 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 820 msec
OK
v2.go2yd.com	53935102
Time taken: 19.111 seconds, Fetched: 1 row(s)

8.脚本自动化刷新数据进入hive

#!/bin/bash
process_data=20180717

echo "step1:mapreduce etl"
#安装常理输出到分区里;输出参数加上day=20180717
hadoop jar /opt/scripts/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /input /home/hadoop/data/output/day=$process_data

#数据刷到hive中
echo "step2:hdfsdata mv hive"
hadoop fs -rm -r /home/hadoop/data/clear/day=$process_data
hadoop fs -mkdir -p /home/hadoop/data/clear/day=$process_data
hadoop fs -mv /home/hadoop/data/output/day=$process_data/part-r-00000 /home/hadoop/data/clear/day=$process_data

echo "step3:Brush the metadata"
hive -e "alter table hadoop_access add if not exists partition(day=$process_data)"