一、Hive练习
1.分区表
我们有这样的一组数据:
6,APPLE,北京,35
2,APPLE,上海,47
数据脱敏,将原来的真实数据中敏感的信息进行模糊化,或者部分数据直接删除。
id-》uuid-》变成了自动增长
手机型号-》具体到什么品牌什么型号-》只保留了手机品牌
城市-》 国-县-市-区(数字表示)-》只保留城市
年龄-》身份证号码-》经过脱敏后,转换为年龄
继续脱敏,删除部分数据,去除字段 。。。。。
数据量:55Wx2 -》 超过百万的数据量
我们在hive中执行如下的命令
select count(*) from table where city=‘上海’
hive会从第一条数据开始匹配,一直匹配到最后一条,匹配了100万次。
有没有更好的解决方案?分区表(非常“傻”的方法来解决这个问题)
原来Hive仓库中的结构:
/user/hive/warehouse/
|-db_lianjia.db/
|- tb_lj/
|-2nd.csv
类似这样的结构目录对于超过百万的数据,能否应对业务需求,进行优化?
appleinbeijing.txt 55w+
appleinshanghai.txt 55w+
/user/hive/warehouse/db_phone.db/tb_phone_info/appleinbeijing.txt
/user/hive/warehouse/db_phone.db/tb_phone_info/appleinshanghai.txt
在执行where city='上海’的时候,很明显,appleinbeijing.txt这个文件内没有1条数据是符合要求的。
/user/hive/warehouse/
|-db_phone/
|-tb_phone_info/
|-dir_city=shanghai/
appleinshanghai.txt
|-dir_city=beijing/
appleinbeijing.txt
where dir_city=shanghai 的时候,hive只会去 |-dir_city=shanghai/目录中查找数据。
苹果手机 在 北京 的销售记录
苹果手机 在 上海 的销售记录
华为手机 在 广州 的销售记录
|-phone=apple
|-city=shanghai
|-appleinshanghai1.txt
|-appleinshanghai2.txt
|-appleinshanghai3.txt
|-…
|-city=beijing
|-city=guangzhou
|-phone=huawei
|-phone=xiaomi
…
实践出真知:
hive (default)> show databases;
OK
database_name
db_hive
db_lianjia
default
empmanager
Time taken: 6.617 seconds, Fetched: 4 row(s)
hive (default)> create database db_phone;
OK
Time taken: 0.233 seconds
hive (default)> use db_phone;
OK
Time taken: 0.072 seconds
hive (db_phone)> create table db_phone.tb_phone_info(
> id string,
> phone string,
> city string,
> age int
> )
> partitioned by (citystr string,phonestr string)
> row format delimited fields terminated by ',';
OK
Time taken: 0.215 seconds
hive (db_phone)> load data local inpath '/opt/datas/appleinbeijing.txt' overwrite into table db_phone.t
b_phone_info partition(citystr='beijing',phonestr='apple');Loading data to table db_phone.tb_phone_info partition (citystr=beijing, phonestr=apple)
Partition db_phone.tb_phone_info{citystr=beijing, phonestr=apple} stats: [numFiles=1, numRows=0, totalS
ize=13767756, rawDataSize=0]OK
Time taken: 2.096 seconds
hive (db_phone)> load data local inpath '/opt/datas/appleinshanghai.txt' overwrite into table db_phone.
tb_phone_info partition(citystr='shanghai',phonestr='apple');Loading data to table db_phone.tb_phone_info partition (citystr=shanghai, phonestr=apple)
Partition db_phone.tb_phone_info{citystr=shanghai, phonestr=apple} stats: [numFiles=1, numRows=0, total
Size=13832187, rawDataSize=0]OK
Time taken: 1.258 seconds
hive (db_phone)> select * from db_phone.tb_phone_info limit 5;
OK
tb_phone_info.id tb_phone_info.phone tb_phone_info.city tb_phone_info.age tb_phon
e_info.citystr tb_phone_info.phonestr6 APPLE 北京 35 beijing apple
23 APPLE 北京 39 beijing apple
28 APPLE 北京 26 beijing apple
30 APPLE 北京 25 beijing apple
42 APPLE 北京 42 beijing apple
Time taken: 0.36 seconds, Fetched: 5 row(s)
hive (db_phone)> set hive.mapred.mode;
hive.mapred.mode=nonstrict
hive (db_phone)> set hive.mapred.mode=strict;
hive (db_phone)> select * from db_phone.tb_phone_info limit 5;
FAILED: SemanticException [Error 10041]: No partition predicate found for Alias "tb_phone_info" Table "
tb_phone_info"hive (db_phone)> select count(*) from db_phone.tb_phone_info where citystr='beijing' and phonestr='appl
e';Query ID = hadoop_20200727084444_708191e1-9c37-45a4-b9d2-223772ec82fa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1595808781664_0001, Tracking URL = http://hadoop212:8088/proxy/application_159580878
1664_0001/Kill Command = /opt/modules/hadoop-2.6.0-cdh5.7.6/bin/hadoop job -kill job_1595808781664_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-07-27 08:44:21,093 Stage-1 map = 0%, reduce = 0%
2020-07-27 08:44:30,635 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 4.95 sec
2020-07-27 08:44:39,215 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 7.34 sec
MapReduce Total cumulative CPU time: 7 seconds 340 msec
Ended Job = job_1595808781664_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 7.34 sec HDFS Read: 13774913 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 340 msec
OK
_c0
553154
Time taken: 32.698 seconds, Fetched: 1 row(s)
2.练习(Hive和SQOOP)
(1)练习要求
建表核心代码
create table tb_yhd_0000(
id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string
)
partitioned by (daystr string,hourstr string)
row format delimited fields terminated by '\t';
load data ...'../..2818' ..(dayset='20150828' ,hourstr='18')
load data ...'../..2819' ..(dayset='20150828' ,hourstr='19')
查询核心代码
pv
select count(case when length(url)=0 then null else url end) as pv ...
uv
select count(distinct guid) as uv from ...
查询结果:
datastr hourstr pv uv
20150828 18 8876 2135
20150828 19 8421 3126
最后使用sqoop保存至MySQL数据库
yhd.sql -> 执行hive命令
yhd.sqoop -> 执行sqoop命令
yhd.sh -> 调用 bin/hive -f /opt/datas/yhd.sql 和 bin/sqoop --options-file /opt/datas/yhd.sqoop
(2)完成练习
create database if not exists db_yhd;
use db_yhd;
drop table if exists tb_yhd;
create table tb_yhd(
id string, url string, referer string, keyword string, type string, guid string, pageId string, moduleId string, linkId string, attachedInfo string, sessionId string, trackerU string, trackerType string, ip string, trackerSrc string, cookie string, orderCode string, trackTime string, endUserId string, firstLink string, sessionViewNo string, productId string, curMerchantId string, provinceId string, cityId string, fee string, edmActivity string, edmEmail string, edmJobId string, ieVersion string, platform string, internalKeyword string, resultSum string, currentPage string, linkPosition string, buttonPosition string)
partitioned by (daystr string,hourstr string)
row format delimited fields terminated by '\t';
load data local inpath '/opt/datas/2015082818' into table tb_yhd partition (daystr='20150828',hourstr='18');
load data local inpath '/opt/datas/2015082819' into table tb_yhd partition (daystr='20150828',hourstr='19');
set hive.exec.model.local.auto=true;
select count(case when length(url)=0 then null else url end) as pv from tb_yhd group by daystr,hourstr;
select count(distinct guid) as uv from tb_yhd group by daystr,hourstr;
drop table if exists tb_yhd_result;
create table tb_yhd_result(
daystr string,
hourstr string,
pv bigint,
uv bigint
)
row format delimited fields terminated by '\t';
insert overwrite table tb_yhd_result
select daystr,hourstr , count(case when length(url)=0 then null else url end) as pv , count(distinct guid) as uv from tb_yhd group by daystr,hourstr;
export
--connect
jdbc:mysql://hadoop212:3306/db_yhd
--username
root
--password
Mysql_1234
--table
tb_yhd_result
--export-dir
'/user/hive/warehouse/db_yhd.db/tb_yhd_result'
--input-fields-terminated-by
'\t'
-m
1
#! /bin/bash
# HIVE_HOME & SQOOP_HOME
HIVE_HOME=/opt/modules/hive-1.1.0-cdh5.7.6
SQOOP_HOME=/opt/modules/sqoop-1.4.6-cdh5.7.6
# run!
$HIVE_HOME/bin/hive -f /opt/datas/yhd.sql
$SQOOP_HOME/bin/sqoop --options-file /opt/datas/yhd.sqoop
二、flume日志采集系统
1.flume介绍
到今天为止,我们发现,日志文件数据,也是非常重要的!
日志文件包括很多种,系统运行日志(hive.log),用户行为日志(2018091718),等,都成为日志数据。
每天企业的运转会在各个框架,集群,业务点,各个维度产生非常多的日志文件。那么对于重要的日志文件,我们需要一个工具来将整个生成运营过程中的不同地方的日志文件统一采集处理存放在指定地点!
flume就是一个使用Java编写的日志采集框架软件。
2.让flume跑起来
[hadoop@hadoop212 tools]$ tar -zxf flume-ng-1.6.0-cdh5.7.6.tar.gz -C /opt/modules/
[hadoop@hadoop212 modules]$ mv apache-flume-1.6.0-cdh5.7.6-bin/ flume-1.6.0-cdh5.7.6-bin/
[hadoop@hadoop212 conf]$ mv flume-env.sh.template flume-env.sh
[hadoop@hadoop212 conf]$ vim flume-env.sh
拷贝hdfs的配置文件到conf目录,flume会自动读取!
[hadoop@hadoop212 conf]$ cp /opt/modules/hadoop-2.6.0-cdh5.7.6/etc/hadoop/core-site.xml /opt/modules/hadoop-2.6.0-cdh5.7.6/etc/hadoop/hdfs-site.xml ./
如果要操作hdfs,需要hdfs相关的jar包,简单的方法是将hadoop下的所有jar包复制过来,我这里将需要的jar包挑出来了,只需要复制这一些即可。
如果要找jar包,maven仓库里都有
[hadoop@hadoop212 conf]$ vim a-hive-men-log.properties
内容:,注意中文不要输入:
# 声明3个组件(给3个组件起个名字)
# 任务名字叫a1,这个任务中有3个组件,source起名s1,channel起名c1,sink起名k1
a1.sources = s1
a1.channels = c1
a1.sinks = k1
# 现在要开始定义这3个组件
# exec表示执行一条命令
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/modules/hive-1.1.0-cdh5.7.6/logs/hive.log
# 使用内存作为数据的中转站,capacity定义的是能够存储多少条数据
# transactionCapacity定义每次可以从我这里拿走多少数据`
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 让输出直接打印到控制台
a1.sinks.k1.type = logger
# 绑定上面3个组件
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
properties文件,key = value 对的形式的文件。注意 等号两边的空格。
执行agent:
[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file ./conf/a-hive-men-log.properties --name a1 -Dflume.root.logger=INFO,console
![ ]()
3.第2个案例
- b-hive-file-log.properties
a1.sources = s1
a1.channels = c1
a1.sinks = k1
a1.sources.s1.type = exec
a1.sources.s1.command = tail -f /opt/modules/hive-1.1.0-cdh5.7.6/logs/hive.log
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/datas/checkpoint
a1.sinks.k1.type = logger
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
执行
[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file case/b-hive-file-log.properties --name a1 -Dflume.root.logger=INFO,console
这个案例,我需要学会 ,自己从官网查看需要哪些配置!
- 检查点的作用
4.第3个案例
- c-spool-mem-hdfs.properties
a1.sources = s1
a1.channels = c1
a1.sinks = k1
a1.sources.s1.type = spooldir
a1.sources.s1.spoolDir = /opt/datas/flume/spool
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = hdfs
# 文件夹必须事先存在, 没有调用mr程序
a1.sinks.k1.hdfs.path = /flume/spool
a1.sources.s1.channels = c1
a1.sinks.k1.channel = c1
执行
友情提醒,写配置要仔细,因为运行不成功没有抛出异常信息。
[hadoop@hadoop212 flume-1.6.0-cdh5.7.6-bin]$ bin/flume-ng agent --conf /opt/modules/flume-1.6.0-cdh5.7.6-bin/conf/ --conf-file case/c-spool-mem-hdfs.properties --name a1 -Dflume.root.logger=INFO,console
flume会将已经上传的数据文件名字修改为添加后缀名.COMPLETED
相对的,flume进程没有停,那么集群上的文件又后缀名.tmp
等到进程停止后,自动会将集群上的文件后缀名.tmp去掉
5.扩展案例
6.官网案例
补充作业
每个同学 , 在纸上写一下 学习感想和老师评价,写好你的班级和名字,然后拍照私发给我!谢谢。
尽量 今天完成, 最晚30日之前。感想和评价写一起,一张纸,字数不限,个人建议50字够了。
稍微认真写,最终要给学校的。