hive udf 查看建表结构 hive查看建表时间

转载

mob64ca13f38b94 2024-02-20 11:13:02

文章标签 hive udf 查看建表结构 hive 数据加载 文章分类 Hive 大数据

1、表操作

建表（建表时需要注意前面不要添加空格回车之类的内容，防止各种异常）

create table if not exists employees( 
 name string, 
 salary float, 
 subordinates array<string>, 
 deductions map<string,float>, 
 address struct<street:string,city:string,state:string,zip:int> 
 ) 
 row format delimited fields terminated by '\t' 
 collection items terminated by ',' 
 map keys terminated by ':' 
 lines terminated by '\n' 
 stored as textfile 
 location '/data/';

查看建表语句(经测试，hive0.9不支持下面的查看建表语句，hive0.14支持)

show create table employees;

格式化查看表结构

desc formatted employees;

如果需要查看详细信息，也可以使用desc employees;

删除表：drop table employees;

显示表：show tables;

显示数据库：show databases;

使用默认的数据库：use default;

表内容：

wang 
 123 
  a1,a2,a3 
 k1:1,k2:2,k3:3 
 s1,s2,s3,4 
 li 
 235 
  a4,a5,a6 
 k4:4,k5:5,k6:6 
 s4,s5,s6,9 
 zhao 
 878 
  b1,b2,b3 
 q1:1,q2:2,q3:3 
 f1,f2,f3,4

加载数据(从本地加载需要使用local，否则需要先将数据加载到hdfs中)

load data local inpath '/usr/local/opt/data/mydata' overwrite into table employees;

数据查询（查询数组、Map、struct中的内容）

select name,subordinates[0],deductions["k1"],address.city from employees;

根据一个表创建另外一张表

create table test1 like employees;

create table test2 as select name,address from employees;

hive不同文件读取对比

stored as textfile

①直接查看hdfs

②hadoop fs -text

stored as sequencefile

①hadoop fs -text

stored as rcfile

①hive -service rcfilecat path

stored as inputformat 'class'

①outformat 'class'

2、hive自定义jar包加载

方法一：将jar包copy到hive的lib目录下，然后重启客户端；

方法二：在hive客户端命令行中使用：add jar PATH;

3、分区表(创建表时所有的注释都要删除，否则创建表时会报错)

create table if not exists employees_c( 
 name string, 
 salary float, 
 subordinates array<string>, 
 deductions map<string,float>, 
 address struct<street:string,city:string,state:string,zip:int> 
 ) 
 partitioned by (country string,date string)//添加分区说明信息 
 row format delimited fields terminated by '\t' 
 collection items terminated by ',' 
 map keys terminated by ':' 
 lines terminated by '\n' 
 stored as textfile 
 location '/data/';

对于已经创建完成的表添加分区信息

alter table employees add if not exists partition(country="cn",date="20150502")

对于已经存在表分区信息的表删除分区

alter table employees drop if exists partition(country="cn",date="20150502");

查看分区信息

show partitions employees;

4、桶表

create table bucketed_user( 
 id int, 
 name string 
 )

//按照Id进行聚集，按照name进行排序，放到4个桶里面

clustered by (id) sorted by (name) into 4 buckets

stored as textfile;

如果想要使桶表生效，先要执行如下命令，或者修改配置文件：

set hive.enforce.bucketing=true;

导入数据(直接使用load data不能将数据加载成桶表的格式)

insert overwrite table bucketed_user select sname,saddr from test1;

简单查询

select * from bucketed_user where id = 'a'

5、hive -help下的命令

hive -e "select * from user_table;"

这可以在Shell脚本中使用，不进入hive cli获取hive的查询结果,也可以使用hive -f sqlfilePath打到

同样的效果

hive -v -f a.txt > ./res.txt

-v保存到文件时同时会将sql语句打印到res.txt中

在hive cli中可以使用如下命令：

list jar;

list file;//列出加载到hadoop集群缓存中的jar包和文件

source "/usr/local/opt/sql/hive_sql"

//在命令行方式中执行HQL

6、hive变量

set val='';//设置变量

${hiveconf:val} //读取变量

环境变量 ${env:HOME},其中env是查看所有环境变量

在hive cli中使用

set val="test";

select * from employees where name=${hiveconf:val};

select ${env:HOME} from employees;

7、数据加载

内表数据加载

①创建表是加载

create table newTable as select col1,col2 from oldTable

②创建表时指定数据位置

create table tableName(...) location ''

③本地数据加载

load data local inpath 'localPath' [overwrite] into table tableName

注意：如果添加上overwrite表示覆盖重写，也就是删除原有数据，然后加载新数据

④加载HDFS数据

load data inpath 'hdfsPath' [overwrite] into table tableName

注意：加载HDFS数据操作是移动数据，不是复制数据

⑤还可以使用hadoop命令拷贝数据到指定位置（可以在hive的Shell中执行和Linux的Shell中执行）

在command中执行hadoop fs -copyFromLocal localPath hiveTablePath

然后再hive中查询就可以看到数据已经被加载了。

实际上Hadoop命令也可以在hive命令行中运行，可以使用dfs -ls /;所以上面的命令也可以使用

hive > dfs -copyFromLocal localPath hiveTablePath

⑥由查询语句加载数据

insert [into|overwrite] table tableName 
 selelct col1,col2 
 from table  
 where ...

方法二：

from table 
 insert [into|overwrite] table tableName 
 select col1,col2 
 where ...

注意：字段对应不同于一些关系型数据库，是按照顺序进行对应，而不是名称

外表数据加载

①创建表时指定数据位置

create external table tableName(...) location '..'

②查询加载和使用Shell操作同内表操作

分区表数据加载：

内部分区表数据加载方式类似于内表

外部分区表数据加载方式类似于外表

注意：数据存放的路径层次要和表的分区一致

如果分区表没有新增分区，即使目标路径下已经有数据了，但依然查不到数据

区别：加载数据指定目标表的同时，需要指定分区

加载数据时添加了partition

eg:

load data local inpath 'linuxPath' overwrite into table tableName partition (pn='') 
  
     
 insert into table tableName partition (pn='') select col1,col2 from tableName2

hive数据加载需要注意的问题

①分隔符的问题，并且分隔符默认只有单个字符

②数据类型对应问题

load数据，字段类型不能相互转化时，查询返回NULL

select查询插入，字段类型不能相互转化时，插入数据为NULL

③select查询插入数据，字段值顺序要与表中字段顺序一致，名称可以不一致

hive在数据加载时不做检查，查询时检查

④外部分区表需要添加分区才能看到数据

8、可以在hive的Shell中使用hadoop命令/linux shell命令

hive> dfs -copyFromLocal /usr/local/opt/data1 /data;

dfs -ls /;

其他类似

使用Linux Shell命令（前面添加!）

eg: !ls /usr/local/;

9、hive数据导出

①hadoop命令

get

eg:hadoop fs -get hdfsPath linuxPath(hadoop fs -get /data/* /usr/local/opt/my/data/)

text

eg: hadoop fs -text hdfsPath > file

②通过insert。。directory方式

insert overwrite [local] directory '/linuxPath' [row format delimited fields terminated by '\t']

select name,salary,addr from employees;

如果不使用local，那么后面row format..这一句也就不支持

③Shell命令加管道： hive -f/e|sed/grep/awk > file

④第三方工具（sqoop）

10、hive动态分区

参数说明：

①set hive.exec.dynamic.partition=true;//使用动态分区

②set hive.exec.dynamic.partition.mode=nonstrict|strict;//nonstrict无限制模式，

如果模式是strict,则必须有一个静态分区，且放在最前面

③set hive.exec.max.dynamic.partitions.pernode=10000;//每个

节点生成动态分区的最大个数

④set hive.exec.max.dynamic.partitions=100000;//生成动态分区的最大个数

⑤set hive.exec.max.created.files=1500000;//一个任务最多可以创建的文件数目

⑥set hive.datanode.max.xcievers=8192;//限定一次最多打开的文件数

建议一个表一天产生的分区不要超过1000个，防止MySQL出现问题

hql:insert overwrite table dy_partition_table partition(分区字段（splitName）)

select name,addr as splitName from oldTable;

11、表属性的操作

修改表名称

alter table tableName rename to newTableName;

修改列名

alter table tableName change column c1 c2 int comment '..' after severity;

//默认放在最后，通过after可以把该列放在指定列的后面或者使用'first'放到第一位

eg:

alter table employee change column type type string after address;

alter table employee change column type type string first;

增加列

alter table tableName add columns(c1 string comment '..',c2 long comment 'xx');

修改tblproperties

alter table tableName set tblproperties(property_name=property_value,property_name=property_value,...);

针对无分区表与有分区表不同

无分区表（修改字段内容分隔符）

alter table tableName set serdeproperties('field.delim'='\t');

注意：会导致之前存在的分区无法应用新修改的属性

有分区表（修改字段内容分隔符）

alter table test1 partition(dt='xx') set serdeproperties('field.delim'='\t');

修改location

alter table tableName [partition()] set location 'path'

内部表转外部表

alter table tableName set tblproperties('EXTERNAL'='TRUE');

外部表转内部表

alter table tableName set tblproperties('EXTERNAL'='FALSE');

可以在wiki：LanguageManual DDL中查看hive修改表操作

动态分区：

set hive.exec.dynamic.partition=true;//开启动态分区

如果set hive.exec.dynamic.partition.mode=nonstrict;

那么插入动态分区数据时可以不使用静态分区

eg:insert overwrite table test_part partition(dt,value)

select 'abc' as name,createDate as dt, addr as value from testext;

如果set hive.exec.dynamic.partition.mode=strict;

那么插入动态分区数据时，至少第一个分区是静态分区

eg:insert overwrite table test_part partition(dt='20150505',value)

select 'abc' as name, addr as value from testext;

12、hive高级查询

聚合操作

1)count计数

count(*) count(1) count(col)

count(*)如果一行中的所有值都为NULL，那么count(*)不加一

count(1)对于上面的这种情况，count（1）也会加一

2)sum求和

sum(可以转成数字的值)返回bigint

sum(col)+cast(1 as bigint)//总数加一

3)avg求平均值

avg(可以转成数字的值)返回double

4)distinct去重

count(distinct col)

Order by

select col1,col2, from test where condition order by col1,col2 [asc|desc]

注意：order by 后面可以有多列进行排序，默认按照字典排序

order by 为全局排序

order by 需要reduce操作，并且只有一个reduce，与配置无关

group by

按照某些字段的值进行分组，将相同的值放在一起

select col1[,col2],count(1),sel_expr(聚合操作) from table where condition 
group by col1[,col2] [having]

注意：select后面非聚合列必须出现在group by中

除去普通列就是一些聚合操作

group by 后面也可以跟表达式,比如substr(col)

特性：使用了reduce操作，受限于reduce数量，设置reduce参数：set mapred.reduce.tasks=5;

输出文件个数与reduce数相同，文件大小与reduce处理的数据量有关

问题：网络负载过重

数据倾斜，优化参数：set hive.groupby.skewindata=true;

join

两个表m,n之间按照on条件进行连接，m中的一条记录和n中的一条记录组成一条新的记录

join：等值连接，需要某个值在m和n中同时存在

left outer join:左外连接，左边表中的无论是否在右边表中存在时，都输出，右边表的值只有在左边表

中存在时才输出

right outer join:右外连接，和left outer join 相反

left semi join

：类似于exists

mapjoin:在Map端完成join操作，不需要使用reduce，基于内存做join，属于优化操作

说明：在Map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作

其中使用了分布式缓存技术

优缺点：不消耗集群的reduce资源（reduce资源相对紧缺）

减少了reduce操作，加快程序执行

降低网络负载

占用部分内存，所以加载到内存中的表不能过大，因为每个计算节点都会加载一次

生成较多的小文件

配置以下参数，由hive根据SQL字段选择common join还是mapJoin

set hive.auto.convert.join=true;

hive.mapjoin.smalltable.filesize默认值是25M

第二种方式，手动指定：

select /*+mapjoin(n)*/ m.col,m.col2,n.col3 from m join n on m.col = n.col;

其中不管/*,还是+都不能省略

mapjoin的使用场景

1）关联操作中有一张表非常小

2）不等值的链接操作

如果join发生数据倾斜，可以使用优化参数：set hive.optimize.skewjoin=true;

分桶

一般使用分区就足够了

对于每一个表（table）或者分区，hive可以进一步分桶，也就是说桶是更为细粒度的数据范围划分

hive是针对某一列进行分桶

hive采用队列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶中

好处:

获得更高的查询处理效率

使取样更高效

分桶的使用

select * from bucketed_user tablesample(bucket 1 out of 2 on id)

bucket join

需要先设置以下值才可以使用bucket join

set hive.optimize.bucketmapjoin=true; 
set hive.optimize.bucketmapjoin.sortedmerge=true; 
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

连接两个在（包含连接列）相同列上划分了桶的表，可以使用Map端连接（Map-side join）高效的实现。比如

join操作。对于join操作两个表有一个相同的列，如果这两个表都进行了桶操作。那么将保存相同列值的桶进行

join操作就可以了，可以大大减少join的数据量。

对于Map端连接的情况，两个表以相同的方式划分桶。处理左边表内某个桶的mapper知道右边表内相匹配的行在

对应的桶内。因此，mapper只需要获取那个桶(这只是右边表内存储数据的一小部分)即可进行连接。这一优化方法

并不一定要求两个表桶的个数相同，两个表的桶个数是倍数关系也可以。

distribute分散数据

distribute by col//按照col列把数据分散到不同的reduce

sort排序

sort by col //按照col列把数据排序

select col1,col2 from m_table distribute by col1 sort by col1 asc,col2 desc;

一般distribute和sort结合出现，确保每个reduce的输出都是有序的

应用场景：

map输出的文件大小不均

reduce输出的文件大小不均

小文件过多

文件超大

对比：

distribute by 与group by

都是按照key值划分数据

都使用reduce操作

唯一不同，distribute by只是单纯的分散数据，而group by把相同的key的数据聚集到一起，

后续必须是聚合操作

order by 与 sort by

order by是全局排序

sort by 只是确保每个reduce上面输出的数据有序，如果只有一个reduce时，和order by作用一样

cluster by

把有相同值的数据聚集到一起，并排序

cluster by col效果等同于distribute by col order by col

union all

多个表的数据合并成一个表，hive不支持union

select col from ((select a as col from t1) union all (select b as col from t2))tmp

要求：

字段名字一样

字段类型一样

字段个数一样

字表不能有别名

如果需要从合并之后的表中查询数据，那么合并的表需要要有别名

13、函数

1)显示当前会话有多少函数可用

show functions;

2)显示函数的描述信息

desc function concat;

3)显示函数的扩展描述信息

desc function extended concat;

demo:

select cast(1.5 as int) from employee;//cast类型转换

其他内置函数参见hive函数手册

下面两个函数每个分区的第一个数总是从0开始的

cume_dist() over(partition b id order by money) 
  
 //((想通知最大行号)/(行数)) 
 percent_rank() over(partition by id order by money) 
   //((相同值最小行号-1)/(行数-1))

混合函数

可以调用Java类和方法

java_method(class,method[,arg1[,arg2..]])

reflect(class,method[,arg1[,arg2..]])//java_method 和reflect是相同的

eg:select java_method("java.lang.Math","sqrt",cast(id as double)) from employee;

UDTF

表函数

lateralView:lateral view udtf(expression) tableAlias as columnAlias(',' columnAlias)*fromClause: 
from baseTable (lateralView)*

例：explode函数：把一行内容拆分成多行

eg:

select id ,adid from winfunc lateral view explode(split(type,'B')) tt as adid

正则表达式

下面两个例子是正则贪婪匹配和非贪婪匹配的对比,索引是按照小括号走的，0表示匹配全部

eg:select regexp_extract('979|7.10.80|8684','.*\\|(.*)',1) from employee limit 1;

结果为：8684

select regexp_extract('979|7.10.80|8684','(.*?)\\|(.*)',1) from employee limit 1;

结果为：979

14、用户自定义函数

UDF：用户自定义函数（user defined function）

针对单条记录

创建函数步骤

1)自定义一个Java类

2)继承UDF类

3)重写evaluate方法

4)打jar包

5)hive执行add jar add jar /usr/local/opt/jar.jar

6)hive执行创建模板函数 create temporary function bigthan as 'com.johnson.hive.udf.UdfTest';

7)hql中使用 select name1, bigthan(name1,500) from employee;

测试代码如下：

package com.johnson.hive.udf; 
import org.apache.hadoop.hive.ql.exec.UDF; 
import org.apache.hadoop.io.Text; 
public class UdfTest extends UDF { 
/** 
 
* 自定义evaluate方法，方法名固定，参数和返回值按照项目要求改变 
        * 如果t1>t2,return true 
        * else return false 
        * @return 
        */ 
public boolean evaluate(Text t1,Text t2){ 
boolean flag = false; 
if(t1!=null&&t2!=null){ 
double d1 = 0; 
double d2 = 0; 
try{ 
d1 = Double.parseDouble(t1.toString()); 
d2 = Double.parseDouble(t2.toString()); 
}catch(Exception e){} 
if(d1>d2){ 
flag = true; 
} 
} 
return flag; 
} 
}

UDAF：用户自定义聚合函数（user defined aggregation function）

针对记录集合

开发通用步骤：

1)第一个是编写resolver类，resolver负责类型检查，操作符重载

2)第二个是编写evaluator类，evaluator真正实现UDAF的逻辑

通常来说，顶层UDAF类继承org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,

类名编写嵌套类evaluator实现UDAF的逻辑

一、实现resolver

resolver通常继承org.apache.hadoop.hive.ql.udf.GenericUDAFResolver2,但是

建议继承AbstractGenericUDAFResolver，隔离将来hive接口的变化。GenericUDAFResolver

和GenericUDAFResolver2的接口区别是，后面的运行evaluator实现可以访问更多的信息，例如

distinct限定符，通配符function（*）

二、实现evaluator

所有的evaluator必须继承抽象类org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator.

子类必须实现它的一些抽象方法，实现UDAF逻辑。

Mode：这个类比较重要，他表示了UDAF在MapReduce的各个阶段，理解Mode的含义，就可以理解UDAF的

运行流程。

下面是源代码：

public static enum Mode{ 
  
     
 PARTIAL1, 
  
     
 PARTIAL2, 
  
     
 FINAL, 
  
     
 COMPLETE 
  
    }

PARTIAL1:这个是Mapreduce的Map阶段。从原始数据到部分数据集合，将会调用iterate()和

terminatedPartial()

PARTIAL2:这个是Mapreduce的Map阶段的？Combiner阶段，负责在Map端合并Map的数据，从

部分数据聚合到部分数据聚合，将会调用merge()和terminatedPartial()

FINAL:mapreduce的reduce阶段。从部分数据的聚合到完全聚合，将会调用merge和terminate

COMPLETE：如果出现了这个阶段，表示Mapreduce只有Map，没有reduce，所有Map端就直接出

结果了。从原始数据直接到完全聚合，将会调用iterate()和terminate()

可以看下源码中的sum/count聚合函数的实现

永久函数：

对于在hive Shell下是用add jar,只在当前Shell下有效，当Shell关闭在打开后，添加的临时UDF就会失效，

可以使用下面的方法将函数进行永久使用

1）如果希望在hive中自定义一个函数，并且能永久使用，可以修改源码添加相应的函数类，然后修改

ql/src/java/org/apache/hadoop/hive/ql/exec/Function/Registry.java类，添加相应的注册

函数代码

registerUDF("parse_url",UDFParseUrl.class,false);

这种方法一般用在集群刚刚搭建的时候，需要修改hive源代码，并从新编译打包

2）新建hiverc文件 ----这种方法比较常用

$HOME/.hiverc //在当前用户的$HOME目录下新建.hiverc文件(vim .hiverc或者touch .hiverc)

把初始化语句加载到文件中

在文件中加载初始化语句的demo：

-- add self functions 
 //注释 
add jar /usr/local/opt/extenal_jar/jar.jar; 
 //添加jar文件 
create temporary function bigthan as 'com.johnson.hive.udf.UdfTest'; //注册别名

15、hive SQL优化

join优化

set hive.optimize.skewjoin=true;如果是join过程出现倾斜，应该设置为true

set hive.skewjoin.key=100000; 这个是join的键赌赢的记录条数，超过这个值则进行优化

mapjoin 
     
set hive.auto.convert.join=true; 
     
hive.mapjoin.smalltable.filesize默认值是25M 
     
select /*+mapjoin(A)*/ f.a,f.b from A t join B f on (f.a = t.a)

简单总结，mapjoin的适用场景

1)关联操作中有一张表非常小

2)不等值的链接操作

bucket join

使用条件：

两个表以相同方式划分桶

两个表的桶个数是倍数关系

create table order(cid int,price float) clustered by (cid) into 32 buckets; 
     
create table customer(id int, first string) clustered by (id) into 32 buckets; 
     
select price from order t join customer s on t.cid = s.id

join优化案例

优化前：select m.cid,u.id from order m join customer u on m.cid = u.id where m.dt='2013-01-01';

优化后：select m.cid,u.id from (select cid from order where dt='2013-01-01')m join customer u on m.cid = u.id;

原因：因为hive先执行join，然后执行where，这和关系型数据库里面sql执行的顺序是不一样的，所以

必须这样写,尤其是在表进行分区的情况下更明显

group by优化：

set hive.groupby.skewindata=true;如果group by过程中出现倾斜，应该设置为true

set hive.groupby.mapaggr.checkinterval=100000;这个是group的键对应的记录条数超过这个值后

就会进行优化

count distinct

优化前：select count(distinct id ) from tableName;

优化后：select count(1) from (select distinct id from tableName) tmp;

select count(1) from (select id from tableName group by id) tmp;

hive SQL优化

优化前： 
select a,sum(b),count(distinct c),count(distinct d) from test group by a; 
优化后： 
select a,sumb(b) as b,count(c) as c,count(d) as d 
from ( 
select a,0 as b,c, null as d from test group by a,c 
union all select a,0 as b,null as c,d from test group by a,d 
union all select a,b null as c,null as d from test) tmp1 
group by a;

16、hive优化

目标：

在有限的资源下，提高运行效率

常见问题：

数据倾斜

Map数设置

reduce数设置

其他

hive执行顺序：HQL-》Job-》Mapreduce

执行计划：

查看执行计划：explain[extended] hql

demo: 
select col,count(1) from test2 group by col; 
explain select col,count(1) from test2 group by col;

17、hive表优化

分区

静态分区

动态分区

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

分桶

set hive.enforce.bucketing=true;

set hive.enforce.sorting=true;

数据

相同数据尽量聚集在一起（可以降低网络数据负载）

18、hive MapReduce优化

job优化

并行化执行

每个查询被hive转化为多个阶段，有些阶段关联性不大，则可以并行化执行，减少执行时间

set hive.exec.parallel=true;

set hive.exec.parallel.thread.number=8;

本地化执行

set hive.exec.mode.local.auto=true;

当一个job满足如下条件才能真正使用本地模式：

1）job的输入数据大小必须小于参数

hive.exec.mode.local.auto.inputbytes.max(默认是128M)

2）job的Map数必须小于参数：

hive.exec.mode.local.auto.tasks.max(默认4)

3）job的Reduce数必须为0或者1

job合并输入小文件

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

合并文件数由mapred.max.split.size限制的大小决定

job合并输出小文件

set hive.merge.smallfiles.avgsize=256000000;当输出文件平均大小小于该值，启动新job合并文件

set hive.merge.size.per.task=64000000;合并后的文件大小

JVM 重利用

set mapred.job.reuse.jvm.num.tasks=20;

jvm重利用可以使job长时间保留slot，知道作业结束，这在对于有较多任务和较多小文件的任务是非常

有意义的，减少执行时间。当然这个值不能设置过大，因为有些作业会有reduce任务，如果reduce任务

没有完成，则Map任务占用的solt不能施法，其他的作业可能就需要等待。

压缩数据

中间压缩：

中间压缩就是处理hive查询的多个job之间的数据，对于中间压缩，最好选择一个节省CPU耗时的

压缩方式。

set hive.exec.compress.intermediate=true;

set hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;

set hive.intermediate.compression.type=BLOCK;

hive查询最终的输出也可以压缩

set hive.exec.compress.output=true;

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

set mapred.output.compression.type=BLOCK;

Map优化

set mapred.map.tasks=10;有时候会无效,原因是Map个数的计算公式如下:

1)默认Map格式

default_num = total_size/block_size;

2)期望大小

goal_num = mapred.map.tasks;

3)设置处理的文件大小

split_size = max(mapred.min.split.size,block_size);

split_num = total_size/split_size;

4)计算的Map的个数

compute_map_num = min(split_num,max(default_num,goal_num))

经过上面分析可知,在设置Map个数的适合,可以简单的总结为以下几点:

1)如果想增加Map个数,则设置mapred.map.tasks为一个较大的值 .

2)如果想减小Map个数,则设置mapred.min.split.size为一个较大的值.

情况1:输入文件size巨大,但不是小文件

增大mapred.min.split.size的值

情况2:输入文件数量巨大,并且都是小文件,就是单个文件的size小于blockSize.这种情况通过增

大mapred.min.split.size不可行,需要使用CombineFileInputFormat将多个input path

合并成一个inputSplit送给mapper进行处理,从而减少mapper的数量.

map端聚合

set hive.map.aggr=true;//相当于combiner

推测执行

mapred.map.tasks.speculative.execution

19. hive shuffle优化

Map端 
io.sort.mb 
io.sort.spill.percent 
min.num.spill.for.combine 
io.sort.factor 
io.sort.record.percent 
reduce端 
mapred.reduce.parallel.copies 
mapred.reduce.copy.backoff 
io.sort.factor 
mapred.job.shuffle.input.buffer.percent 
mapred.job.reduce.input.buffer.percent

20. hive reduce优化

需要reduce操作的查询

聚合函数

sum/count/distinct/...

高级查询

group by, join, distribute by ,cluster by ..

order by 比较特殊,只需要一个reduce

推测执行

1)mapred.reduce.tasks.speculative.execution

2)hive.mapred.reduce.tasks.speculative.execution

这两种方式那种都可以

reduce优化 
set mapred.reduce.tasks=10;//直接设置 
hive.exec.reducers.max默认:999 
hive.exec.reducers.bytes.per.reducer 每个reduce计算的文件大小,默认:1G 
计算公式 
numRTasks = min[maxReducers,input.size/perReducer] //使用的reduce的计算公式 
maxReducers = hive.exec.reducers.max 
perReducer = hive.exec.reducers.bytes.per.reducer

21.针对不同来源汇总的数据仓库

对于内容:

1)不同数据源进行处理

2)不同数据格式进行统一格式

3)不同来源数据统一字段

4)非统一字段使用集合

5)来自不同来源使用分区

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：springboot能写树形菜单吗 springboot shoir

下一篇：stm32cubemx如何将单片机设置成按键 stm32单片机按键控制灯亮灭

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯