Hive_二

  • Hive调优
  • Fetch 抓取
  • 本地模式
  • 表的优化
  • 大表,小表join
  • 大表Join大表
  • 空key转换
  • 小表JOIN大表(MapJoin)
  • Group by
  • 统计去重
  • 笛卡尔积
  • 行列过滤
  • 动态分区调整
  • JVM重用
  • Hive实战之谷粒影音
  • 统计视频观看数Top10
  • 统计视频类别热度Top10
  • 统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
  • 统计视频观看数Top50所关联视频的所属类别排序
  • 统计每个类别中的视频热度Top10(以MUSIC为例)
  • 统计每个类别中视频流量Top10
  • 统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频


Hive调优

Fetch 抓取

Fetch抓取是指,Hive中对某些情况的查询可以不使用MR计算,例如select * from emp,在这种情况下,Hive可以简单的读取emp对应的存储目录下的文件,然后输出查询结果。
在hive-default.xml.template文件中hive.fetch.task.conversion默认是more,老版本hive默认是minimal,该属性修改为more以后,在全局查找、字段查找、limit查找等都不走mapreduce。

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
  </property>

可以暂时设置为none,执行拆线呢语句,都会执行MR:

0: jdbc:hive2://hadoop01:10000>  set hive.fetch.task.conversion=none;
No rows affected (0.015 seconds)
0: jdbc:hive2://hadoop01:10000>  set hive.fetch.task.conversion;
+----------------------------------+--+
|               set                |
+----------------------------------+--+
| hive.fetch.task.conversion=none  |
+----------------------------------+--+
1 row selected (0.009 seconds)

执行查询语句,都会执行MR
所以我们还是设置为more

本地模式

大多数Hadoop是需要Hadoop提供的完整的可扩展性来处理大数据集的。不过,有时Hive输入的数据量是非常小的,这种情况下,为查询触发执行任务的消耗时间可能会比实际的job执行时间多,对于这种情况,Hive可以通过本地模式在单台机器上处理所有任务,对于小的数据集,执行时间被明显缩短
用户可以通过设置hive.exec.mode.local.auto的值为true,来让Hive在适当的时候自动启动这个优化,默认是false

set hive.exec.mode.local.auto=true;  //开启本地mr
//设置local mr的最大输入数据量,当输入数据量小于这个值时采用local  mr的方式,默认为134217728,即128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置local mr的最大输入文件个数,当输入文件个数小于这个值时采用local mr的方式,默认为4
set hive.exec.mode.local.auto.input.files.max=10;

实操:

set hive.exec.mode.local.auto=true;//开启本地模式
select * from emp cluster by deptno;
14 rows selected (1.687 seconds)
//关闭本地模式
 set hive.exec.mode.local.auto=false;
 select * from emp cluster by deptno;
 14 rows selected (24.147 seconds)

表的优化

大表,小表join

将key相对分散,并且数据两小的表放在join左边,这样可以有效减少内存溢出错误几率;再进一步,可以使用mapjoin让小的维度表先进内存,在map端完成reduce
新版Hive对小表Join大表和大表Join小表多了优化,现在已经没有明显区别

大表Join大表

首先,打开Hadoop03的历史服务器

sbin/mr-jobhistory-daemon.sh start historyserver

http://hadoop03:19888/jobhistory 创建原始数据表
空id表
join后的数据表

0: jdbc:hive2://hadoop01:10000> create table ori(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
0: jdbc:hive2://hadoop01:10000> create table nullidtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';

加载数据到原始数据表和空id表

0: jdbc:hive2://hadoop01:10000> load data local inpath '/opt/module/datas/ori' into table ori;
INFO  : Loading data to table yyx_hive.ori from file:/opt/module/datas/ori
INFO  : Table yyx_hive.ori stats: [numFiles=1, totalSize=121734744]
No rows affected (5.551 seconds)
0: jdbc:hive2://hadoop01:10000> load data local inpath '/opt/module/datas/nullid' into table nullidtable;
INFO  : Loading data to table yyx_hive.nullidtable from file:/opt/module/datas/nullid
INFO  : Table yyx_hive.nullidtable stats: [numFiles=1, totalSize=118645854]
No rows affected (3.824 seconds)

不过滤空id:

0: jdbc:hive2://hadoop01:10000> insert overwrite table jointable select n.* from nullidtable n
0: jdbc:hive2://hadoop01:10000> left join ori o on o.id=n.id;
No rows affected (43.064 seconds)

过滤空id

0: jdbc:hive2://hadoop01:10000> insert overwrite table jointable select n.* from (select * from nullidtable where id is not null) n
0: jdbc:hive2://hadoop01:10000> left join ori o on o.id=n.id;
No rows affected (32.962 seconds)

数据量有点小,可能不是非常明显

空key转换

虽然某个key为空对应的数据很多,但是相应的数据并不是异常数据,必须要包含在join的结果中,此时我们可以以表a中key为空的字段赋一个随机的值,是的数据随机均匀的分到不哦她那个reduce上
实操:
首先,reduce数量设为5

0: jdbc:hive2://hadoop01:10000> set mapreduce.job.reduces=5;

JOIN两张表

insert overwrite table jointable
select n.* from nullidtable n left join ori b on n.id = b.id;

hive select 字段个数限制 hive select values_hadoop


由于数据倾斜,某一个reduce的时间有点久

随机分布null

insert overwrite table jointable
select n.* from nullidtable n full join ori o on 
case when n.id is null then concat('hive', rand()) else n.id end = o.id;

hive select 字段个数限制 hive select values_hive_02


可以看出来,消除了数据倾斜,负载均衡reducer的资源消耗

小表JOIN大表(MapJoin)

如果不指定MapJoin或者不符合MapJoin条件,那么Hive解析器会将Join操作转换成Common Join,即:在Reduce阶段完成Join,容易发生数据倾斜,可以用MapJoin吧小表全都加载到内存在map端进行join,避免reducer处理
开启MapJoin功能

set hive.auto.convert.join = true;

大表小表阈值设置
(默认25MB以下为小表)
之所以是25是因为太大IO太大

set hive.mapjoin.smalltable.filesize=25000000;

Group by

默认情况下,Map阶段同一key数据发给一个reduce,当一个key的数量过大时,就会发生数据倾斜,并不是所有的聚合操作都要在Reduce端完成,很多聚合操作都可以现在Map端进行部分聚合,最后在Reduce端得出最终结果
开启Map端聚合参数设置

set hive.map.aggr = true

在Map端进行聚合操作的条目数目

set hive.groupby.mapaggr.checkinterval = 100000

有数据倾斜的时候进行负载均衡(默认false)

set hive.groupby.skewindata = true

当设置为true的时候,生成的查询计划会有两个MR JOB,第一个job会进行预演,看是否会发生数据倾斜,如果发生数据倾斜,会启动第二个JOB,做出负载均衡的处理
虽然会更慢,但是对于大量数据来讲,总体来看,值得

统计去重

数据量小的时候无所谓,但是数据量大的时候,由于Count Distinct的全聚合操作,即使设定了Reduce task的格式,hive也会启动一个reducer,这就造成了一个Reduce处理的数据量过大,导致整个Job很难完成。一般Count Distinct使用先group by再count的方式替换。
实操:
创建一张大表

0: jdbc:hive2://hadoop01:10000>  create table bigtable(id bigint, time bigint, uid string, keyword
0: jdbc:hive2://hadoop01:10000> string, url_rank int, click_num int, click_url string) row format delimited
0: jdbc:hive2://hadoop01:10000> fields terminated by '\t';

加载数据

0: jdbc:hive2://hadoop01:10000> load data local inpath '/opt/module/hive/datas/bigtable' into table
0: jdbc:hive2://hadoop01:10000>  bigtable;

设置reduce数为5

set mapreduce.job.reduces = 5;

执行去重ID查询

0: jdbc:hive2://hadoop01:10000> select count(distinct id) from bigtable;
+---------+--+
|   c0    |
+---------+--+
| 100001  |
+---------+--+
1 row selected (39.255 seconds)

用group by去重id

0: jdbc:hive2://hadoop01:10000> select count(id) from (select id from bigtable group by id) a;
+---------+--+
|   _c0   |
+---------+--+
| 100001  |
+---------+--+
1 row selected (62.758 seconds)

因为多执行了一次select,所以会更慢,但是在数据量特别大的情况下,这是值得的

笛卡尔积

尽量避免笛卡尔积,因为Hive只能使用一个reduce来完成笛卡尔即,而且笛卡尔积的数据时无效数据

行列过滤

列处理:在select中,只拿需要的列,尽量使用分区过滤,少使用select *
行处理:在分区剪裁中,当使用外关联表的时候,如果将副表的过滤条件写在where后面,那么就会先全表关联,之后再过滤,我们可以时候子查询后关联表

select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;
100 rows selected (44.563 seconds)

我们可以先子查询,再关联表

select b.id from bigtable b
join (select id from ori where id <= 10 ) o on b.id = o.id;
100 rows selected (42.128 seconds)

动态分区调整

对分区表insert数据的时候,数据库会自动根据分区字段的值,将数据插入到相应的分区中,Hive也提供了类似的机制(动态分区)
开启动态分区功能:

hive.exec.dynamic.partition=true

设置为非严格模式

set hive.exec.dynamic.partition.mode=nonstrict

再所有执行的MR节点上,最大一共可以创建1000个动态分区
(默认1000)

set hive.exec.max.dynamic.partitions=1000

在每个执行的MR节点上,最大可以创建动态分区数量需要根据实际数据设定。如果数据中包含一年数据,day字段有365个值,那么分区数量就要大于365
整个MR job中,最大可以创建多少HDFS文件
(默认100000)

hive.exec.max.created.files=100000

当有空分区生成时,是否抛出异常
(默认false)

hive.error.on.empty.partition=false

实操:
将dept表中数据按照loc字段插入到目标表dept_partition的相应分区中
创建目标表:

create table dept_partition(id int, name string) partitioned
by (location int) row format delimited fields terminated by '\t';

插入数据:

0: jdbc:hive2://hadoop01:10000> insert into table dept_partition partition(location) select deptno, dname, loc from dept;

查看目标分区表情况:

0: jdbc:hive2://hadoop01:10000> show partitions dept_partition;
+----------------+--+
|   partition    |
+----------------+--+
| location=1700  |
| location=1800  |
| location=1900  |
+----------------+--+
3 rows selected (1.49 seconds)

JVM重用

JVM重用是Hadoop调优参数的内容,其对Hive的性能具有非常大的影响,特别是对于很难避免小文件的场景或task特别多的场景,这类场景大多数执行时间都很短。
Hadoop的默认配置通常是使用派生JVM来执行map和Reduce任务的。这时JVM的启动过程可能会造成相当大的开销,尤其是执行的job包含有成百上千task任务的情况。JVM重用可以使得JVM实例在同一个job中重新使用N次。N的值可以在Hadoop的mapred-site.xml文件中进行配置。通常在10-20之间,具体多少需要根据具体业务场景测试得出。

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

这个功能的缺点是,开启JVM重用将一直占用使用到的task插槽,以便进行重用,直到任务完成后才能释放。如果某个“不平衡的”job中有某几个reduce task执行的时间要比其他Reduce task消耗的时间多的多的话,那么保留的插槽就会一直空闲着却无法被其他的job使用,直到所有的task都结束了才会释放。

Hive实战之谷粒影音

需求:

–统计视频观看数Top10

–统计视频类别热度Top10

–统计视频观看数Top20所属类别

–统计视频观看数Top50所关联视频的所属类别Rank

–统计每个类别中的视频热度Top10

–统计每个类别中视频流量Top10

–统计上传视频最多的用户Top10以及他们上传的视频

–统计每个类别视频观看数Top10

视频表:

hive select 字段个数限制 hive select values_hadoop_03


用户表

hive select 字段个数限制 hive select values_hive_04


我们首先进行数据重组清洗操作。即:将所有的类别用“&”分割,同时去掉两边空格,多个相关视频id也使用“&”进行分割。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class ETLForVideoTable {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());
        job.setMapperClass(ETLForVideoTableMap.class);
        job.setNumReduceTasks(0);
        job.setJarByClass(ETLForVideoTable.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);
        //设置输入输出参数
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class ETLForVideoTableMap extends Mapper<LongWritable, Text, Text, NullWritable> {
    Text k = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 将数据split
        String line = value.toString();
        // 写一个方法,将数据清洗
        String data = eTLVideo(line);

        // 如果数据清洗后为null
        if (data == null) {
            context.getCounter("ETL", "False").increment(1);
        } else {
            context.getCounter("ETL", "True").increment(1);
            k.set(data);
            context.write(k, NullWritable.get());
        }
    }

    /**
     * 对数据进行清洗的方法
     *
     * @param line 输入的行数据
     * @return 处理后的数据
     */
    private String eTLVideo(String line) { //先想最特殊的情况
        String[] split = line.split("\t");
        StringBuffer stringBuffer = new StringBuffer();
        stringBuffer.delete(0, stringBuffer.length());
        if (split.length < 9) {
            // 说明数据有问题,那么这行数据为null
            return null;
        }
        // 要将第三位数组的空格去掉
        split[3] = split[3].replace(" ", "");
        for (int i = 0; i < split.length; i++) { //遍历整个数组
            // 如果是最后一位,就不用拼接&了
            if (i == split.length - 1) {
                stringBuffer.append(split[i]);
            } else if (i < 9) { //如果小于九,用\t拼接
                stringBuffer.append(split[i]).append("\t");
            } else { //大于9,就要用&拼接
                stringBuffer.append(split[i]).append("&");
            }
        }
        return stringBuffer.toString();
    }
}

将数据上传至hdfs,并且对video数据进行etl

hive select 字段个数限制 hive select values_hive select 字段个数限制_05


创建表并导入数据

create table gulivideo_ori(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited 
fields terminated by "\t"
collection items terminated by "&"
stored as textfile;
create table gulivideo_user_ori(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as textfile;
create table gulivideo_orc(
    videoId string, 
    uploader string, 
    age int, 
    category array<string>, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int,
    relatedId array<string>)
row format delimited fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;
create table gulivideo_user_orc(
    uploader string,
    videos int,
    friends int)
row format delimited 
fields terminated by "\t" 
stored as orc;

导入数据

load data inpath "/gulivideo/output/video/2008/0222" into table gulivideo_ori;
load data inpath "/gulivideo/user/2008/0903" into table gulivideo_user_ori;
insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;

业务分析

统计视频观看数Top10

简单的根据views降序排列即可
控制输出数为10

0: jdbc:hive2://hadoop01:10000> select
0: jdbc:hive2://hadoop01:10000> videoid,
0: jdbc:hive2://hadoop01:10000> age,
0: jdbc:hive2://hadoop01:10000> uploader,
0: jdbc:hive2://hadoop01:10000> length,
0: jdbc:hive2://hadoop01:10000> category,
0: jdbc:hive2://hadoop01:10000> views
0: jdbc:hive2://hadoop01:10000> from gulivideo_ori
0: jdbc:hive2://hadoop01:10000> order by views desc
0: jdbc:hive2://hadoop01:10000> limit 10;
+--------------+------+------------------+---------+---------------------+-----------+--+
|   videoid    | age  |     uploader     | length  |      category       |   views   |
+--------------+------+------------------+---------+---------------------+-----------+--+
| dMH0bHeiRNg  | 415  | judsonlaipply    | 360     | ["Comedy"]          | 42513417  |
| 0XxI-hvPRRA  | 286  | smosh            | 194     | ["Comedy"]          | 20282464  |
| 1dmVU08zVpA  | 670  | NBC              | 165     | ["Entertainment"]   | 16087899  |
| RB-wUgnyGv0  | 506  | ChrisInScotland  | 159     | ["Entertainment"]   | 15712924  |
| QjA5faZF1A8  | 308  | guitar90         | 320     | ["Music"]           | 15256922  |
| -_CSo1gOd48  | 190  | tasha            | 205     | ["People","Blogs"]  | 13199833  |
| 49IDp76kjPw  | 381  | TexMachina       | 59      | ["Comedy"]          | 11970018  |
| tYnn51C3X_w  | 516  | CowSayingMoo     | 231     | ["Music"]           | 11823701  |
| pv5zWaTEVkI  | 531  | OkGo             | 184     | ["Music"]           | 11672017  |
| D2kJZOfq7zk  | 199  | mrWoot           | 185     | ["People","Blogs"]  | 11184051  |
+--------------+------+------------------+---------+---------------------+-----------+--+
10 rows selected (36.279 seconds)

统计视频类别热度Top10

即统计每个类别有多少个视频,显示出包含视频最多的前10个类别
首先,要将类别类炸列(因为是一个数组)

select
	EXPLODE(category) as category_name
from gulivideo_ori; t1

之后,按照热度排序

select category_name,count(*) num --查询每个组的数据数量
from t1
group by category_name -- 分组
order by num desc
limit 10;

统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数

首先、计算观看数最高的20个视频

select views,category
from gulivideo_ori
order by views desc
limit 20;t1

将其所属类别炸列

select explode(category) as category_name
from ()t1;t2

每个类别包含前20个视频个数

select 
	count(*) num,
	category_name
from t2
group by category_name
order by num desc
limit 20;

统计视频观看数Top50所关联视频的所属类别排序

先统计观看数前50以及其关联视频id(炸开)

select videoId,views,relatedid
from gulivideo_ori
order by views desc
limit 50; t1

炸开

select explode(relatedid) exid
from t1;t2

查询exid所属类型(和原表join,videoid=exid)

select g.category gc,views
from t2
join gulivideo_ori g
on g.videoId = t2.exid;t3

查询到的是数组,炸开

select explode(gc) category_name
from t3;t4

按照炸列排序

select
category_name,
count(category_name) count_category_name
from t4
group by category_name
order by count_category_name desc;

统计每个类别中的视频热度Top10(以MUSIC为例)

select
views,
videoId,
category_name
from gulivideo_ori
lateral view explode(category) tbl as category_name
where
category_name = 'Music'
order by views desc
limit 10;

我们发现每次都要炸列category,很麻烦
创建一个新表并导入数据

create table gulivideo_category(
    videoId string, 
    uploader string, 
    age int, 
    categoryId string, 
    length int, 
    views int, 
    rate float, 
    ratings int, 
    comments int, 
    relatedId array<string>)
row format delimited 
fields terminated by "\t" 
collection items terminated by "&" 
stored as orc;
insert into table gulivideo_category  
    select 
        videoId,
        uploader,
        age,
        categoryId,
        length,
        views,
        rate,
        ratings,
        comments,
        relatedId 
    from 
        gulivideo_orc lateral view explode(category) catetory as categoryId;

这样即可查询:

select
    videoId,
    views,
    categoryId
from
    gulivideo_category
where
    categoryId='Music'
order by views desc
limit 10;

统计每个类别中视频流量Top10

这是一个分组,且组内top的问题,用窗口函数(组内排序-开窗 over+rank)
组内TOP并且给出top

select 
    videoid,
    ratings,
    categoryid,
    rank() over(partition by categoryId order by ratings desc) rk
from gulivideo_category;t1

查询:

select 
    videoid,
    ratings,
    categoryid
from t1
where rk<=3; -- 10太多了

统计上传视频最多的用户Top10以及他们上传的观看次数在前20的视频

查询上传视频最多的10个用户

select 
    uploader,
    videos
from 
    gulivideo_user_ori
order by videos desc 
limit 10;t1

和视频表join,取出他们上传的所有视频

select 
    videoid,
    views
from gulivideo_ori
join t1 on t1.uploader=gulivideo_ori.uploader;t2

找出他们上传的观看次数前20的

select views,
    videoid
from t2
order by views desc
limit 10;