Hadoop离线项目之数据清洗
1.背景
1.1 企业级大数据项目开发流程
- 项目调研:技术?业务?
以业务为导向的项目调研,根据不同的行业对项目进行调研,主要以 产品经理、非常熟悉业务的、项目经理等人员进行项目调研. - 需求分析:明确做什么 做成什么样子的(做东西不要局限于某个技术)
首先需求是由用户提出,是可见的,明确规定了需要什么,其次是隐式的需求,隐式的需求需要开发人员自己来实现,要根据业界的一些东西能制定. - 方案设计
1.概念设计:系统的模块模块,模块中有哪些功能点.
2.详细设计:具体到所有功能的实现,本套系统用哪些技术来搞定,每个功能点涉及到哪些表哪些模块,表的字段方法名接口等都要定义好,甚至是类名以及方法名.
3.系统设计:能否扩展,能否容错,可不可以定制化,监控告警等等都是这里的. - 功能开发(文档代码化)
1.开发:代码层次
2.测试(本地环境):单元测试 ,程序员自测, 自动测试所有的单元测试,测试没问题后才可以上线.
3.测试(测试人员的测试):包括功能测试,联调测试(部门间的联调测试),性能测试(一般是压力屙屎,需要去线上调整性能资源),用户测试(主要是用户的体验测试) - 部署上线
1.试运行,新的和老的系统一起运行,一般新系统跟旧系统一起运行一个月以上,保证两个系统的结果完全一致,同事还要比较两者的稳定性以及差异,保证新系统的稳定.
2.正式上线 ,灰度测试, 一般一年或者一年半(会可能用到容器docker之类,开箱即用的;比如需要用到一些相同环境的机器) - 后期
还有项目二期,三期,四期,运维保障,功能开发,bug修改.
1.2 企业级大数据应用平台
- 数据分析
- 自研:自己研发的平台,基于开源框架进行二次开发,后期方便构建用户画像
- 商业:商业软件,维护方便,但是数据在第三方,缺点明显.
- 搜索/爬虫
包括但不限于elk / hbase / soler / luncen 等工具的使用 - 机器学习/深度学习
- 人工智能
- 离线处理
根据表面上毫无关联的大量数据,挖掘数据间更深层次的价值 - 实时处理
2.基于Maven构建大数据开发项目
1.IDEA创建maven项目
idea的使用不多做介绍,开发必须会,构建完之后如下图:
2.maven pom文件配置
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.7</maven.compiler.source>
<maven.compiler.target>1.7</maven.compiler.target>
<!--指定hadoop.version,保证添加的Hadoop依赖的版本相同-->
<hadoop.version>2.6.0-cdh5.7.0</hadoop.version>
<hive.version>1.1.0-cdh5.7.0</hive.version>
</properties>
<!--添加CDH的仓库,因为maven的中央仓库并不包含cdh的仓库,所以必须额外指定-->
<repositories>
<repository>
<id>cloudera</id>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>
</repositories>
<dependencies>
<!--添加Hadoop的依赖-->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoop.version}</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>${hive.version}</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
自己本地开发使用的版本无需与生产环境的版本一致,当然一致更好,maven工程打包分为胖包和瘦包,胖包(jar包和代码全打包),瘦包 (一般是打成瘦包,只打包自己开发的代码不管jar包)
3.造数据
既然是数据清洗,数据必不可少,构建数据采用java代码开发
package com.ruozedata.hadoop.utils;
import java.io.FileWriter;
import java.io.IOException;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Locale;
import java.util.Random;
public class ProduceLog {
//cdn的厂商构建
public static String create_cdn(){
String stringarray[] = {"baidu","tencent","wangyi","ali","jingdong"};
Random random = new Random();
String result = stringarray[random.nextInt(5)];
return result;
}
//url地址构建
public static String create_url() {
String result = "";
Random random = new Random();
for (int i = 0; i <= 35; i++) {
String str = random.nextInt(2) % 2 == 0 ? "num" : "char";
if ("char".equalsIgnoreCase(str)) { // 产生字母
int nextInt = random.nextInt(2) % 2 == 0 ? 65 : 97;
result += (char) (nextInt + random.nextInt(26));
} else if ("num".equalsIgnoreCase(str)) { // 产生数字
result += String.valueOf(random.nextInt(10));
}
}
result = "http://v1.go2yd.com/user_upload/"+result;
return result;
}
/*
* 将十进制转换成IP地址
*/
public static String num2ip(int ip) {
int[] b = new int[4];
String x = "";
b[0] = (int) ((ip >> 24) & 0xff);
b[1] = (int) ((ip >> 16) & 0xff);
b[2] = (int) ((ip >> 8) & 0xff);
b[3] = (int) (ip & 0xff);
x = Integer.toString(b[0]) + "." + Integer.toString(b[1]) + "." + Integer.toString(b[2]) + "." + Integer.toString(b[3]);
return x;
}
//ip地址构建
public static String create_ip() {
// ip范围
int[][] range = {{607649792, 608174079}, // 36.56.0.0-36.63.255.255
{1038614528, 1039007743}, // 61.232.0.0-61.237.255.255
{1783627776, 1784676351}, // 106.80.0.0-106.95.255.255
{2035023872, 2035154943}, // 121.76.0.0-121.77.255.255
{2078801920, 2079064063}, // 123.232.0.0-123.235.255.255
{-1950089216, -1948778497}, // 139.196.0.0-139.215.255.255
{-1425539072, -1425014785}, // 171.8.0.0-171.15.255.255
{-1236271104, -1235419137}, // 182.80.0.0-182.92.255.255
{-770113536, -768606209}, // 210.25.0.0-210.47.255.255
{-569376768, -564133889}, // 222.16.0.0-222.95.255.255
};
Random rdint = new Random();
int index = rdint.nextInt(10);
String ip = num2ip(range[index][0] + new Random().nextInt(range[index][1] - range[index][0]));
return ip;
}
//访问时间构建
public static String create_time() {
String result_time = "";
try {
DateFormat targetFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
DateFormat sourceFormat = new SimpleDateFormat("yyyyMMddHHmmss");
Random rndYear = new Random();
int year = rndYear.nextInt(2) + 2017; //生成[2017,2018]的整数;年
Random rndMonth = new Random();
int int_month = rndMonth.nextInt(12) + 1;
String month = String.format("%02d",int_month); //生成[1,12]的整数;月
Random rndDay = new Random();
int int_Day = rndDay.nextInt(30) + 1; //生成[1,30)的整数;日
String Day = String.format("%02d",int_Day);
Random rndHour = new Random();
int int_hour = rndHour.nextInt(23); //生成[0,23)的整数;小时
String hour = String.format("%02d",int_hour);
Random rndMinute = new Random();
int int_minute = rndMinute.nextInt(60); //生成分钟
String minute = String.format("%02d",int_minute);
Random rndSecond = new Random();
int int_second = rndSecond.nextInt(60); //秒
String second = String.format("%02d",int_second);
result_time = String.valueOf(year)+month+Day+hour+minute+second;
result_time = targetFormat.format(sourceFormat.parse(result_time));
result_time = "["+result_time+" +0800]";
} catch (ParseException e) {
e.printStackTrace();
}
return result_time;
}
public static void main(String[] args) {
for(int i = 0 ; i <= 100; i++){
try {
String cdn = create_cdn();
String region = "CN";
String level = "E";
String time = create_time();
String ip = create_ip();
String domain = "v2.go2yd.com";
String url = create_url();
String traffic = String.valueOf((int)((Math.random()*9+1) * 100000));
StringBuilder builder = new StringBuilder("");
builder.append(cdn).append("\t")
.append(region).append("\t")
.append(level).append("\t")
.append(time).append("\t")
.append(ip).append("\t")
.append(domain).append("\t")
.append(url).append("\t")
.append(traffic);
String logdata = builder.toString();
FileWriter fileWriter = new FileWriter("D:\\LegainProject\\g6-train-hadoop\\g6-train-hadoop\\src\\test\\json\\log_data.log",true);
fileWriter.write(logdata+"\n");
fileWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
#构建的数据如下,按照tab分割
tencent CN E [19/Mar/2018:01:06:25 +0800] 123.233.86.240 v2.go2yd.com http://v1.go2yd.com/user_upload/7s3POuq7263Vm61UM43x6u3z6yy614b0C89n 632866
4.数据清洗工具类
package com.ruozedata.hadoop.utils;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Locale;
public class LogUtils {
//源时间:使用的时候 SimpleDateFormat方法后面的Locale.ENGLISH不能丢
DateFormat sourceFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
//目标时间
DateFormat targetFormat = new SimpleDateFormat("yyyyMMddHHmmss");
/**
* 日志文件解析,对内容进行字段的处理
* 按\t分割
*/
public String parse(String log) {
String result = "";
try {
String[] splits = log.split("\t");
String cdn = splits[0];
String region = splits[1];
String level = splits[2];
String timeStr = splits[3];
String time = timeStr.substring(1,timeStr.length()-7);
time = targetFormat.format(sourceFormat.parse(time));
String ip = splits[4];
String domain = splits[5];
String url = splits[6];
String traffic = splits[7];
StringBuilder builder = new StringBuilder("");
builder.append(cdn).append("\t")
.append(region).append("\t")
.append(level).append("\t")
.append(time).append("\t")
.append(ip).append("\t")
.append(domain).append("\t")
.append(url).append("\t")
.append(traffic);
result = builder.toString();
} catch (ParseException e) {
e.printStackTrace();
}
return result;
}
}
工具类测试
package com.ruozedata.hadoop.utils;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;
public class TestLogUtils {
private LogUtils utils ;
@Test
public void testLogParse() {
String log = "baidu\tCN\tE\t[17/Jul/2018:17:07:50 +0800]\t223.104.18.110\tv2.go2yd.com\thttp://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4\t17168\t";
System.out.println(logs.split("\t").length);
System.out.println(log.split("\t").length);
String result = utils.parse(log);
System.out.println(result);
}
@Before
public void setUp(){
utils = new LogUtils();
}
@After
public void tearDown(){
utils = null;
}
}
测试结果
baidu CN E 20180717170750 223.104.18.110 v2.go2yd.com http://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4 17168
5.数据清洗开发
1.数据清洗这边只涉及到map,所以只有mapper:
package com.ruozedata.hadoop.mapreduce.mapper;
import com.ruozedata.hadoop.utils.LogUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class LogETLMapper extends Mapper<LongWritable,Text,NullWritable,Text>{
/**
* 通过mapreduce框架的map方式进行数据清洗
* 进来一条数据就按照我们的解析规则清洗完以后输出
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
int length = value.toString().split("\t").length;
if(length == 8) {
LogUtils utils = new LogUtils();
String result = utils.parse(value.toString());
if(StringUtils.isNotBlank(result)) {
context.write(NullWritable.get(), new Text(result));
}
}
}
}
2.程序入口开发
package com.ruozedata.hadoop.mapreduce.driver;
import com.ruozedata.hadoop.mapreduce.mapper.LogETLMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.yarn.util.SystemClock;
public class LogETLDriver {
public static void main(String[] args) throws Exception{
if(args.length != 2) {
System.err.println("please input 2 params: input output");
System.exit(0);
}
//日志目录
String input = args[0];
//输出目录
String output = args[1];
Configuration configuration = new Configuration();
// 判断文件系统是否存在,如果存在就删除
FileSystem fileSystem = FileSystem.get(configuration);
Path outputPath = new Path(output);
if(fileSystem.exists(outputPath)) {
fileSystem.delete(outputPath, true);
}
Job job = Job.getInstance(configuration);
job.setJarByClass(LogETLDriver.class);
job.setMapperClass(LogETLMapper.class);
job.setMapOutputKeyClass(NullWritable.class);
job.setMapOutputValueClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(input));
FileOutputFormat.setOutputPath(job, new Path(output));
//FileOutputFormat.setCompressOutput(job, true);//设置压缩
//FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);//压缩类型
job.waitForCompletion(true);
}
}
3.运行之后的部分结果
ali CN E 20180427223256 139.204.229.77 v2.go2yd.com http://v1.go2yd.com/user_upload/3ycnVNfKN1M72sou8M2YBl9R47Sw39Ghy9E9 357136
wangyi CN E 20170916143028 36.63.193.207 v2.go2yd.com http://v1.go2yd.com/user_upload/8ds0EPwSQI51VF9353zP4118pup7ps84122u 308845
wangyi CN E 20170823044319 121.76.22.38 v2.go2yd.com http://v1.go2yd.com/user_upload/227ZN2nqv833bHgzKX1V87OCgMCQaUSpb2Q0 480903
jingdong CN E 20180930073052 222.75.143.120 v2.go2yd.com http://v1.go2yd.com/user_upload/81q6657fyB62L29Cp1k6z24997Ej2iw6uar3 873249
baidu CN E 20180720164512 61.233.245.28 v2.go2yd.com http://v1.go2yd.com/user_upload/037q1B717a0m9725h12kVz06Tan7heJLdWd7 807397
window机器运行hadoop代码会出现环境问题,解决办法
下载对应的win下运行环境包:https://github.com/steveloughran/winutils
将‘hadoop.dll’和‘winutils.exe’两个文件放入本地%HADOOP_HOME/bin下;同时将hadoop.dll放入C:\Windows\System32文件夹下
6.服务器环境测试
1.将项目打成jar包,这里打的是瘦包
2.打包完之后上传到服务器集群
3.服务器运行jar
#先将处理的数据文件上传到hdfs上
[hdfs@node1 scripts]$ hadoop fs -mkdir /input
[hdfs@node1 scripts]$ hadoop fs -put /opt/scripts/log_data.log /input
[hdfs@node1 scripts]$ hadoop fs -ls /input
Found 1 items
-rw-r--r-- 3 hdfs supergroup 14559 2019-04-15 15:46 /input/log_data.log
#运行jar包
[hdfs@node1 root]$ hadoop jar /opt/scripts/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /input /home/hadoop/data/output
运行结果:
7.Hive完成最基本的统计分析
1.创建外部表,location不能制定mapreduce作业的输出路径,因为会覆盖掉
create external table hadoop_access (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
) partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/home/hadoop/data/clear' ;
2.移动数据导外部表目录,并刷新到hive表
hadoop fs -mkdir -p /home/hadoop/data/clear/day=20180717
hadoop fs -mv /home/hadoop/data/output/part-r-00000 /home/hadoop/data/clear/day=20180717
hive> alter table hadoop_access add if not exists partition(day='20180717');
OK
Time taken: 0.78 seconds
3.hive统计每个域名的traffic之和
hive> select domain,sum(traffic) from hadoop_access group by domain;
Query ID = hdfs_20190415160909_59a4e822-5aee-411e-b079-b07e1f678eb0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1547801454795_0012, Tracking URL = http://node1:8088/proxy/application_1547801454795_0012/
Kill Command = /opt/cloudera/parcels/CDH-5.13.1-1.cdh5.13.1.p0.2/lib/hadoop/bin/hadoop job -kill job_1547801454795_0012
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2019-04-15 16:09:54,840 Stage-1 map = 0%, reduce = 0%
2019-04-15 16:09:59,014 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.7 sec
2019-04-15 16:10:05,439 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.82 sec
MapReduce Total cumulative CPU time: 3 seconds 820 msec
Ended Job = job_1547801454795_0012
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.82 sec HDFS Read: 21511 HDFS Write: 22 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 820 msec
OK
v2.go2yd.com 53935102
Time taken: 19.111 seconds, Fetched: 1 row(s)
8.脚本自动化刷新数据进入hive
#!/bin/bash
process_data=20180717
echo "step1:mapreduce etl"
#安装常理输出到分区里;输出参数加上day=20180717
hadoop jar /opt/scripts/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /input /home/hadoop/data/output/day=$process_data
#数据刷到hive中
echo "step2:hdfsdata mv hive"
hadoop fs -rm -r /home/hadoop/data/clear/day=$process_data
hadoop fs -mkdir -p /home/hadoop/data/clear/day=$process_data
hadoop fs -mv /home/hadoop/data/output/day=$process_data/part-r-00000 /home/hadoop/data/clear/day=$process_data
echo "step3:Brush the metadata"
hive -e "alter table hadoop_access add if not exists partition(day=$process_data)"