Hive进行分区分桶

原创

大数据同盟会 2022-01-12 15:57:19 博主文章分类：Hive ©著作权

©著作权归作者所有：来自51CTO博客作者大数据同盟会的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、Hive建表语法

create  [external] table [ if not exists] table_name 
   [(col_name data_type [comment col_comment], ...)] 
   [comment table_comment] 
   [partitioned by (col_name data_type [comment  col_comment], ...)] 

   [clustered  by  (col_name, col_name, ...) 
   [sorted by  (col_name [asc|desc], ...)] into  num_buckets buckets ] 

   [row format row_format]  
   [stored as  file_format] 
   [location  hdfs_path]

说明：

1、 create table 创建一个指定名字的表。如果相同名字的表已经存在，则抛出异常；用户可以用 if not exists 选项来忽略这个异常。

2、 external关键字可以让用户创建一个外部表，在建表的同时指定一个指向实际数据的路径（location），Hive 创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅记录数据所在的路径，不对数据的位置做任何改变。在删除表的时候，内部表的元数据和数据会被一起删除，而外部表只删除元数据，不删除数据。

3、like 允许用户复制现有的表结构，但是不复制数据。

4、row format

delimited [fields terminated by char]
[collection items terminated by char ]
[map keys terminated by char] 
[lines terminated by char] | serde serde_name 
[with serdeproperties (property_name=property_value, property_name=property_value, ...)]

指定行分割符，row format delimited fields terminated by “分隔符”，用户在建表的时候可以自定义 SerDe 或者使用自带的 SerDe。如果没有指定 row format或者 row format delimited，将会使用自带的 SerDe。

5、stored as

sequence | text  | rc  |orc | parquet

如果不写，默认是 stored as textfile。如果数据需要压缩，使用 stored as sequencefile 。兼容性更好的是parquet 。

6、location

若是不指定位置，hive表会建立在user/hive/warehouse/表名，hive的表会加载user/hive/warehouse/下的所有文件。

7、clustered by

对于每一个表（table）或者分区， Hive可以进一步组织成桶，也就是说桶是更为细粒度的数据范围划分。Hive也是针对某一列进行桶的组织。Hive采用对列值哈希，然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。

把表（或者分区）组织成桶（Bucket）有两个理由：

（1）获得更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，连接两个在（包含连接列的）相同列上划分了桶的表，可以使用 Map 端连接（Map-side join）高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列，如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以，可以大大较少JOIN的数据量。

（2）使取样（sampling）更高效。在处理大规模数据集时，在开发和修改查询的阶段，如果能在数据集的一小部分数据上试运行查询，会带来很多方便。

create table my_hive (
   id   int,
   name string
) 
clustered  by (id)
into 4 buchets
row format delimited fields terminated by '\t' ;

二、Hive数据类型

1、基本数据类型

Hive进行分区分桶_数据

2、集合数据类型

Hive进行分区分桶_1024程序员节_02

创建map类型：

create table custom(
   id int,
   name string,
   info map<string,string>
)
row format delimited fields terminated by ','
collection items terminated by '‐'
map keys terminated by ':';

创建struct类型：

create table stu(
   id int,
   stu_info struct<name:String,age:int,sex:string>
)
row format delimited fields terminated by ','
collection items terminated by ':';

创建array类型：

create table movie(
   id int,
   name string,
   actors array<string>
)
row format delimited fields terminated by ','
collection items terminated by ':';

分区针对的是数据的存储路径，分桶针对的是数据文件；

三、hive分区（partitioned by）

hive中做查询时，一般会扫描整个表内容，会消耗很多时间，因此引入分区概念，只查询关心的一部分数据。

静态分区：将数据直接导入到指定分区中： load
动态分区：根据某个字段判断，将数据插入到分区中

1、一级静态分区：

create table person (
  a int, 
  b int,
  c int
) 
partitioned by (country string)  
row format delimited fields terminated by '\t' ;

2、二级静态分区：

create table person (
  a int, 
  b int,
  c int
) 
partitioned by (country string, province string)  
row format delimited fields terminated by '\t' ;

3、动态分区：

1、创建一个普通表，加载所有数据

2、开启动态分区功能

3、创建分区表

4、通过 insert select 方式导入数据

4、修改分区

alter table 表名 add partition(dt=’09’)   --增加分区

  alter table 表名 drop partition(dt=’09’)   --删除分区

Hive是做批量离线数据分析的，不是做数据库用的，不适合做一条一条数据的插入，没有更新数据操作，数据是直接拿过来读而不会去改。

四、hive分桶（clustered by ）

分桶就是把大表化成小表，Hive进行分桶可以提高join的效率。

1、创建一个普通表，加载数据

2、开启分桶表

set hive.enforce.bucketing = true;
set mapreduce.job.reduces=4;

3、创建分桶表

create table person (
    a int,
    b int, 
    c int
) 
clustered by (a)  
into 4 buckets 
row format delimited fields terminated by '\t' ;

4、insert 插入普通表数据到分桶表

保存select查询结果的几种方式：

1、将查询结果保存到一张新的hive表中；

create table t_tmp  as  select * from t_p;

2、将查询结果保存到一张已存在的hive表中；

insert into  table t_tmp  select * from t_p;

3、将查询结果保存到指定文件目录（本地或者hdfs）。

insert overwrite local directory '/home/hadoop/test'   select * from t_p;

五、hive基本语法

1、查看表结构

desc  表名
desc  formatted  表名    (更详细)

2、load导入数据

load  data[local] inpath   'filepath' [overwrite] into table  tablename [partition (partcol1=val1, partcol2=val2 ...)]

例如：

load data inpath '/user/hive/warehouse/03_basal_layer_zfybxers00.db/zfybxers00_z_rma_cost_detail_m/importdate=20210923' into table 03_basal_layer_zfybxers00.zfybxers00_z_rma_cost_detail_m  partition (importdate="20210923")

3、数据导入导出

import table  表名  from  "filepath"    --表要求不存在，用于数据备份迁移

export table 表名  to  "filepath"

4、查询结果插入Hive表

insert  overwrite  table  tablename1 [partition (partcol1=val1,partcol2=val2 ...)] select_statement1 from  from_statement

5、shell方式

hive -e "sql语句"  >>  本地文件

hive  -f  文件 >/>> 文本文件

6、命令导出

hdfs  dfs  -get  表中文件

hive> dfs -get -ls

7、select查询

select [all   | distinct] select_expr, select_expr, ... 
from  table_reference
[where where_condition] 
[group by col_list [having condition]] 
[cluster by  col_list  | [ distribute by   col_list] [sort by | order by  col_list] 
] 
[limit number]

1、order by：会对输入做全局排序，会导致当输入规模较大时，需要较长的计算时间。

2、sort by：区内排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。

3、distribute by ：指定分区字段。

4、Cluster by：当分区字段和排序字段，是同一个字段做升序时，cluster by = distribute by + sort by

8、where和having区别

where 放在from之后，执行的是原始数据；

having 放在group by之后，执行的是分组后数据

六、hive中各种join

申请表

00001,p01
00002,p02
00003,p03
00004,p04

人员表

p01,00001,张丽,副高
p02,00002,李三,处长
p04,00002,王六,专责
p03,00003,赵五,员工
p05,00003,钱七,员工
p06,00003,周九,员工
p07,00005,孙二,员工

1、join on（内连接）

select 
   a.apply_id, a.person_id,
   b.person_name,b.person_level
from 
   base_apply a  join  renyuan b on (a.apply_id=b.apply_id)

查询结果：

00001   p01     张丽    副高
00002   p02     李三    处长
00002   p02     王六    专责
00003   p03     赵五    员工
00003   p03     钱七    员工
00003   p03     周九    员工

2、inner join on(内连接)

select 
   a.apply_id, a.person_id,
   b.person_name,b.person_level
from 
   base_apply a  inner join  renyuan b on (a.apply_id=b.apply_id)

查询结果：

00001   p01     张丽    副高
00002   p02     李三    处长
00002   p02     王六    专责
00003   p03     赵五    员工
00003   p03     钱七    员工
00003   p03     周九    员工

3、left join on(左连接)

select 
   a.apply_id, a.person_id,
   b.person_name,b.person_level
from 
   base_apply a  left join  renyuan b on (a.apply_id=b.apply_id)

查询结果：

00001   p01     张丽    副高
00002   p02     李三    处长
00002   p02     王六    专责
00003   p03     赵五    员工
00003   p03     钱七    员工
00003   p03     周九    员工
00004   p04     NULL    NULL

4、full outer join on（两边数据都会出来）

select 
   a.apply_id, a.person_id,
   b.person_name,b.person_level
from 
   base_apply a  full outer join   renyuan b on (a.apply_id=b.apply_id)

查询结果：

00001   p01     张丽    副高
00002   p02     王六    专责
00002   p02     李三    处长
00003   p03     周九    员工
00003   p03     钱七    员工
00003   p03     赵五    员工
00004   p04     NULL    NULL
NULL    NULL    孙二    员工

5、left semi join on (in/exists 子查询)

select 
   a.apply_id, a.person_id
from 
   base_apply a  left semi join  renyuan b on (a.apply_id=b.apply_id)

查询结果：

00001   p01
00002   p02
00003   p03

6、union

union:        自动去除重复数据
union all :   所有数据

七、 Hive自定义函数（UDF）

1、导人hive-exec包

2、继承UDF，并重载evaluate

package cn.itcast.bigdata.udf;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF {

  // 必须是public
  public String evaluate(String field) {
    String result = field.toLowerCase();
    return result;
  }
}

3、打成jar包上传到服务器

4、将jar包添加到hive的classpath

hive>add JAR /home/hadoop/udf.jar;

5、创建临时函数与开发好的java class关联

Hive>create temporary function tolowercase as 'cn.itcast.bigdata.udf.ToProvince';

6、在hql中使用自定义的函数

7、hive实现号码归属地

1、编写ToLowerCase类

package cn.itcast.bigdata.udf;
import java.util.HashMap;
import org.apache.hadoop.hive.ql.exec.UDF;

public class ToLowerCase extends UDF {
  public static HashMap<String, String> provinceMap = new HashMap<String, String>();
    static {
      provinceMap.put("136", "beijing");
      provinceMap.put("137", "shanghai");
      provinceMap.put("138", "shenzhen");
    }

  // 必须是public
  public String evaluate(int phonenbr) {

      String pnb = String.valueOf(phonenbr);
      return provinceMap.get(pnb.substring(0, 3)) == null ? "huoxing":provinceMap.get(pnb.substring(0,3));

    }
}

2、打成jar包上传到服务器

3、将jar包添加到hive的classpath

hive>add JAR /home/hadoop/udf.jar;

4、创建临时函数与开发好的java class关联

Hive>create temporary function tolowercase as 'cn.itcast.bigdata.udf.ToProvince';

5、写hive sql 查询

Hive进行分区分桶_1024程序员节_03

8、hive解析json数据

方式一：自定义函数解析

Hive进行分区分桶_json_04

1、编写MovieRateBean类

package cn.itcast.bigdata.udf;
//{"movie":"1721","rate":"3","timeStamp":"965440048","uid":"5114"}
public class MovieRateBean {

  private String movie;
  private String rate;
  private String timeStamp;
  private String uid;
  public String getMovie() {
    return movie;
  }
  public void setMovie(String movie) {
    this.movie = movie;
  }
  public String getRate() {
    return rate;
  }
  public void setRate(String rate) {
    this.rate = rate;
  }
  public String getTimeStamp() {
    return timeStamp;
  }
  public void setTimeStamp(String timeStamp) {
    this.timeStamp = timeStamp;
  }
  public String getUid() {
    return uid;
  }
  public void setUid(String uid) {
    this.uid = uid;
  }
  @Override
  public String toString() {
    return movie + "\t" + rate + "\t" + timeStamp + "\t" + uid;
  }

}

2、编写JsonParser

package cn.itcast.bigdata.udf;
import org.apache.hadoop.hive.ql.exec.UDF;
import parquet.org.codehaus.jackson.map.ObjectMapper;

public class JsonParser extends UDF {

  public String evaluate(String jsonLine) {

    ObjectMapper objectMapper = new ObjectMapper();

    try {
      MovieRateBean bean = objectMapper.readValue(jsonLine, MovieRateBean.class);
      return bean.toString();
    } catch (Exception e) {

    }
    return "";
  }

}

3、打成jar包上传到服务器

4、将jar包添加到hive的classpath

hive>add JAR /home/hadoop/udf.jar;

5、创建临时函数与开发好的java class关联

Hive>create temporary function tolowercase as 'cn.itcast.bigdata.udf.ToProvince';

6、写hive sql 查询

线创建一个t_json表，再把结果放到t_reting表里，实现把结果放到四个字段里

Hive进行分区分桶_json_05

方式二：引入插件

1、把插件json-serde-1.3.8.jar放入hive的lib下

2、建表时导入插件

DROP TABLE IF EXISTS ODS17.APP_ACTION_LOG;
CREATE EXTERNAL TABLE ODS17.APP_ACTION_LOG(
account           string ,  -- 用户登录账号
appId             string ,  -- 用户所使用的app
appVersion        string ,  -- app的版本
carrier           string ,  -- 运营商
deviceId          string ,  -- 设备id
deviceType        string ,  -- 设备型号
eventId           string ,  -- 事件id
ip                string ,  -- ip地址
latitude          double ,  -- 纬度
longitude         double ,  -- 经度
netType           string ,  -- 网络类型（4g，5g，wifi）
osName            string ,  -- 操作系统名称
osVersion         string ,  -- 操作系统版本
releaseChannel    string ,  -- app的发布下载渠道
resolution        string ,  -- 屏幕分辨率
sessionId         string ,  -- 会话id
`timeStamp`       bigint ,  -- 事件时间戳
properties        map<string,string> -- 事件详细属性
)
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE
;

3、导入数据

load data inpath '/logdata/app/2020-10-07' into table ODS17.APP_ACTION_LOG partition (dt='2020-10-07');

方式三：使用json_tuple() 函数

例如：select json_tuple(json,'movie','rate','timeStamp','uid') as(movie,rate,ts,uid) from t_json;

8、transform案例：时间转换为星期

实现Hive中没有的功能又不想写UDF的情况

1、先加载rating.json文件到hive的一个原始表 rat_json

create table rat_json(line string) row format delimited;
load data local inpath '/home/hadoop/rating.json' into table rat_json;

2、需要解析json数据成四个字段，插入一张新的表 t_rating

insert overwrite table t_rating
select get_json_object(line,'$.movie') as moive,get_json_object(line,'$.rate') as rate  from rat_json;

3、编辑一个python脚本文件

vi weekday_mapper.py
#!/bin/python
import sys
import datetime

for line in sys.stdin:
  line = line.strip()
  movieid, rating, unixtime,userid = line.split('\t')
  weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
  print '\t'.join([movieid, rating, str(weekday),userid])

4、文件加入hive的classpath

hive>add FILE /home/hadoop/weekday_mapper.py;

5、执行sql

hive>create TABLE u_data_new as
SELECT
  TRANSFORM (movieid, rate, timestring,uid)
  USING 'python weekday_mapper.py'
  AS (movieid, rate, weekday,uid)
FROM t_rating;

6、查询

hive>select distinct(weekday) from u_data_new limit 10;

Hive进行分区分桶_hive_06

9、hive面试题（累计报表）

Hive进行分区分桶_数据_07

select A.username,A.month,max(A.salary) as salary,sum(B.salary) as accumulate
from 
(select username,month,sum(salary) as salary from t_access_times group by username,month) A 
inner join 
(select username,month,sum(salary) as salary from t_access_times group by username,month) B
on   A.username=B.username
where 
B.month <= A.month   group by A.username,A.month   order by A.username,A.month;

更多java、大数据学习面试资料，请扫码关注我的公众号：

Hive进行分区分桶_hive_08