hive update hive update分区表的一个项目

转载

mob6454cc6bf0b7 2023-09-01 21:10:24

文章标签 hive update python 大数据数据 hive 文章分类 Hive 大数据

1、load 数据

1.1、基本语法：
load data [local] inpath 'path' [overwrite] into table 'table_name' partition(partitionfield='xx');
1.2、本质：就是将数据从INPATH所指定的路径拷贝或者移动到表或者区文件夹中

如果数据是在本地   LOCAL INPATH  ,数据是拷贝
  如果数据是在HDFS上，      INPATH ,数据是移动

2、动态分区

2.1、含义：动态分区就是可以根据select查询出来的结果数据中的一个字段的值不同，而插入另一个表中不同的分区
2.2、例子：

先创建一个t_stu普通的学生详情表 t_a
  	create table t_a(sno int,sname string,sex string) partitioned by(sage string);

  创一个学生基本信息表  t_stu_baseinfo
  	create table t_stu_baseinfo(sno int,sname string,sex string) partitioned by(sage string);

  从详情表中查询若干字段数据插入t_stu_baseinfo表中，并且根据age不同而放入不同的分区
  注意：要使用动态分区，必须先开启动态分区参数

  	hive> set hive.exec.dynamic.partition.mode=nonstrict;
  	hive> insert into table t_a partition(sage) select sno,sname,sex,sage from t_b;

  结果观察

  	hive>  show partitions t_stu_baseinfo;
  	OK
  	sage=17
  	sage=18
  	sage=19
  	sage=20
  	sage=21
  	sage=22
  	sage=23

3、关于JOIN

3.1、join类型：
通用：INNER JOIN(JOIN) , LEFT JOIN(LEFT OUTER JOIN) ,RIGHT JOIN(RIGHT OUTER JOIN),full outer join HIVE专用：left semi join 左半连接
3.2、例子：

准备数据
  1,a
  2,b
  3,c
  4,d
  7,y
  8,u

  2,bb
  3,cc
  7,yy
  9,pp

3.3、建表：

create table a(id int,name string)
  row format delimited fields terminated by ',';

  create table b(id int,name string)
  row format delimited fields terminated by ',';

3.4、导入数据：

load data local inpath '/root/hivedata/a.txt' into table a;
  load data local inpath '/root/hivedata/b.txt' into table b;

  实验：
  ** inner join ==> 只展示两边对的上的
  select a.*,b.* from a inner join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 7     | y       | 7     | yy      |
  +-------+---------+-------+---------+--+

  **left join  ==> a表全部展示，右边的如果没有置空
  select * from a left join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 1     | a       | NULL  | NULL    |
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 4     | d       | NULL  | NULL    |
  | 7     | y       | 7     | yy      |
  | 8     | u       | NULL  | NULL    |
  +-------+---------+-------+---------+--+

  **right join ==> b表全部展示，左边没有置空
  select * from a right join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 7     | y       | 7     | yy      |
  | NULL  | NULL    | 9     | pp      |
  +-------+---------+-------+---------+--+

  **左右连接  ==> 全部都展示出来
  select * from a full outer join b on a.id=b.id;
  +-------+---------+-------+---------+--+
  | a.id  | a.name  | b.id  | b.name  |
  +-------+---------+-------+---------+--+
  | 1     | a       | NULL  | NULL    |
  | 2     | b       | 2     | bb      |
  | 3     | c       | 3     | cc      |
  | 4     | d       | NULL  | NULL    |
  | 7     | y       | 7     | yy      |
  | 8     | u       | NULL  | NULL    |
  | NULL  | NULL    | 9     | pp      |
  +-------+---------+-------+---------+--+

  **hive中的特别左半链接 semi join  ==> 相当于左链接中，全部对上的
  select * from a left semi join b on a.id = b.id;
  效果相当于左连接结果中的左表连接成功的部分
  +-------+---------+--+
  | a.id  | a.name  |
  +-------+---------+--+
  | 2     | b       |
  | 3     | c       |
  | 7     | y       |
  +-------+---------+--+
  相当于
  select * from a where a.id exists(select b.id from b); 在hive中效率极低
  --------------------------------------------------------------------------------

4、关于分组查询

区别于分桶查询，分组查询的结果，一组只有一条记录返回。分桶查询只是将数据按照hash % reducer分开到不同的桶，数据总数前后不变。每个桶里面有个人的记录，而且每个人还可能有多条数据。例如：每个人每个月的上网流量。分桶后，小明的所有上网信息肯定在同一个桶里面，这个桶里面可能还包含小冬、老张的数据。

+-----------------------+--------------------+---------------------+--+
| usermag_tab.username  | usermag_tab.month  | usermag_tab.salary  |
+-----------------------+--------------------+---------------------+--+
| A                     | 2015-01            | 5                   |
| A                     | 2015-01            | 15                  |
| B                     | 2015-01            | 5                   |
| A                     | 2015-01            | 8                   |
| B                     | 2015-01            | 25                  |
| A                     | 2015-01            | 5                   |
| A                     | 2015-02            | 4                   |
| A                     | 2015-02            | 6                   |
| B                     | 2015-02            | 10                  |
| B                     | 2015-02            | 5                   |
+-----------------------+--------------------+---------------------+--+

首先需要设置一些reduce的数量，默认是1

set mapreduce.job.reduces=5;

需要对这组数据按照每个用户、每个月的访问量进行汇总。

select username,month,sum(salary) as salary from usermag_tab group by username,month;

如果需要累计每个用户的访问次数，不按月分，那么可以如下：

select username,max(month) as month,sum(salary) as salary from usermag_tab group by username;

注意：如果写成如下会报错，因为month会有多个值，必须选择其中一个

select username,month,sum(salary) as salary from usermag_tab group by username;

分组查询，没法做到去重，因为返回来的其他值有多个，没法一一对应，这时候一般使用row_number、rank、dense_rank函数为每条数据加上一个值后，在根据值的大小求topN。如果是top1就实现了排重功能

5、关于多重插入

from student
insert into table student_p partition(part='a')
select * where Sno<95011;
insert into table student_p partition(part='b')
select * where Sno>95011;

6、关于导出数据到本地

insert overwrite local directory '/home/hadoop/student.txt'   select * from student;

7、本地模式（本地跑demo的时候可以用）

set hive.exec.mode.local.auto=true;

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：java 数组内存溢出 java数组存在堆里还是栈里

下一篇：kubernetes从harbor拉取镜像没有权限解决方法 kubernetes 镜像仓库

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯

hive update hive update分区表的一个项目

hive update hive update分区表的一个项目

1、load 数据

2、动态分区

3、关于JOIN

4、关于分组查询

5、关于多重插入

6、关于导出数据到本地

7、本地模式（本地跑demo的时候可以用）

51CTO博客