hive csv 建表header hive 建表 default

转载

mob64ca14031c97 2024-02-08 22:24:31

文章标签 hive csv 建表header hive 数据 Time 文章分类 Hive 大数据

1、创建表（MANAGED_TABLE）：

create table student(id bigint,name string) row format delimited fields terminated by '\t' stored as sequencefile;

注：row format delimited表示一行是一条记录

fields terminated by '\t' 表示字段是以\t作为分隔符

stored as sequencefile 表示文件的类型是sequencefile（二进制文件，以键值对来进行组织的），常用的还有textfile（默认是普通文本）

例：

hive csv 建表header hive 建表 default_数据

会在元数据库中生成一下信息：

hive csv 建表header hive 建表 default_数据_02

TBLS表中信息：记录表名字的表（包括表的创建时间、id、拥有者、类型）

hive csv 建表header hive 建表 default_hive csv 建表header_03

COLUMNS_V2表中信息：记录表的列

hive csv 建表header hive 建表 default_hive_04

SDS表中信息：

hive csv 建表header hive 建表 default_Time_05

DBS表记录的是建立的database（默认的是default）：

hive csv 建表header hive 建表 default_hive_06

注：default数据库是放在hdfs://ns1/user/...下的

新建一个数据库：

hive csv 建表header hive 建表 default_hive csv 建表header_07

default对应就是warehouse文件夹

hive csv 建表header hive 建表 default_Time_08

文件夹叫wk110.db

hive csv 建表header hive 建表 default_hive_09

hive csv 建表header hive 建表 default_数据_10

表就对应该文件夹（数据库）下的文件夹t_order_wk，该文件夹下存放的数据是文件的形式

2、加载（导入）数据：

managed_table（内部表）：必须放在某个目录下

2.1 本地导入
load data local inpath '/root/student.txt' into student;

注：就是把这个文件拷贝到HDFS上某个目录下。我们自己把文件丢到这个目录下，我们也可以查询。如果我们格式对不上，也可以查询，只不过没有的地方为NULL，多的字段会被剔除。

例（从本地拷贝到HDFS）：

hive csv 建表header hive 建表 default_Time_12

HDFS可以查看：

hive csv 建表header hive 建表 default_Time_13

可查询其结果与统计个数：

hive csv 建表header hive 建表 default_hive csv 建表header_14

hive csv 建表header hive 建表 default_hive_15

注：不要认为它需要34多秒，就认为他很慢，其实是启动MR慢。

如果自己上传一个文件到该文件夹，可以查询吗？

可以

2.2 从HDFS上导入：

load data inpath ‘/uu.data’ into table student;

注：这时候就不是拷贝，而是移动。

例：

hive csv 建表header hive 建表 default_hive_16

hive csv 建表header hive 建表 default_hive csv 建表header_17

hive csv 建表header hive 建表 default_Time_18

hive csv 建表header hive 建表 default_Time_19

这就会导致一些问题，这些数据是业务系统产生的，当我们建了一个表导致业务系统的数据移动了，业务系统再读这个文件，就不存在了。所以有一个external_table（外部表）。

external_table（外部表）：可以放在hdfs任何地方（原来在哪就在哪）

external_table与managed_table的区别？

答：managed_table需要再指定文件夹下有文件，而external_table（外部表）仍可以存放在原来的地方，不需要移动。

创建external table

create external table student(id int, name string ) row format delimited fields terminated nu '\t' location '/hive_ext';

注:'/hive_ext'——》这是一个目录

例子：

hive csv 建表header hive 建表 default_hive csv 建表header_20

hive csv 建表header hive 建表 default_数据_21

hive csv 建表header hive 建表 default_hive_22

hive csv 建表header hive 建表 default_数据_23

对于drop table XXX;managed_table会删除元素局，还有hdfs下所对应的目录；external_table 只删除了元数据，保留了目录和文件。

3、查询数据

hive> select * from student;
OK
1 zhangsan
2 wangwu
3 lisi
NULL NULL
Time taken: 0.318 seconds, Fetched: 4 row(s)
hive> select * from student limit 2;
OK
1 zhangsan
2 wangwu
Time taken: 0.087 seconds, Fetched: 2 row(s)

注：都不会执行MR。

hive> select sum(id) from student;
 Total jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
   set hive.exec.reducers.bytes.per.reducer=<number>
 In order to limit the maximum number of reducers:
   set hive.exec.reducers.max=<number>
 In order to set a constant number of reducers:
   set mapreduce.job.reduces=<number>
 Starting Job = job_1494075968732_0002, Tracking URL = http://heres04:8088/proxy/application_1494075968732_0002/
 Kill Command = /heres/hadoop-2.2.0/bin/hadoop job  -kill job_1494075968732_0002
 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
 2017-05-06 21:26:09,717 Stage-1 map = 0%,  reduce = 0%
 2017-05-06 21:26:19,428 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.71 sec
 2017-05-06 21:26:28,987 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.4 sec
 MapReduce Total cumulative CPU time: 3 seconds 400 msec
 Ended Job = job_1494075968732_0002
 MapReduce Jobs Launched: 
 Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.4 sec   HDFS Read: 232 HDFS Write: 2 SUCCESS
 Total MapReduce CPU Time Spent: 3 seconds 400 msec
 OK
 6
 Time taken: 31.431 seconds, Fetched: 1 row(s)

注：执行MR程序，其实对大量数据影响不大。

1.4外部表（EXTERNAL_TABLE）
建立外部表：先有数据（数据已经在hdfs中），然后我们创建一个表，让这个表指向这个目录。hive的客户端也可以执行hadoop的命令。
只要把文件放到这个目录下，就可以在hive客户端查询到（内部表也适用，不管是内部表还是外部表，只要把数据放到指定目录下，就可以查询到;但是分区表除外）。

hive> dfs -ls / 
     > ;
 Found 11 items
 -rw-r--r--   3 root supergroup         59 2017-04-18 16:55 /a.txt
 drwxr-xr-x   - root supergroup          0 2017-04-30 15:15 /hbase
 -rw-r--r--   3 root supergroup      27605 2017-04-18 10:44 /install.log
 -rw-r--r--   3 root supergroup      27605 2017-04-18 15:47 /log
 drwxr-xr-x   - root supergroup          0 2017-04-23 17:31 /sqoop
 drwx------   - root supergroup          0 2017-05-04 16:41 /tmp
 drwxr-xr-x   - root supergroup          0 2017-05-04 19:14 /user
 drwxr-xr-x   - root supergroup          0 2017-04-18 18:46 /wca
 drwxr-xr-x   - root supergroup          0 2017-04-18 18:53 /wcc
 drwxr-xr-x   - root supergroup          0 2017-04-18 18:09 /wcout
 drwxr-xr-x   - root supergroup          0 2017-04-18 19:53 /wcs
hive> dfs -mkdir /data;
 hive> dfs -put /root/student.txt /data/a.txt;
 hive> dfs -put /root/student.txt /data/b.txt;
 hive> create external table ext_student(id int,name string) row format delimited fields terminated by '\t' location '/data' ;
 OK
 Time taken: 0.338 seconds
 hive> select * from ext_student;
 OK
 1       zhangsan
 2       wangwu
 3       lisi
 NULL    NULL
 1       zhangsan
 2       wangwu
 3       lisi
 NULL    NULL

1.5 另一种建表方式（用于创建一些临时表存储中间结果）

create table xxx as select id new_id,name new_name from student

创建的新表并且有新数据，同时目录下有文件（新创建的）

1.6 不支持单个insert 支持批量insert（就是往对应表名的文件夹追加文件）用于向临时表中追加中间结果数据

insert overwrite table xx select * from student

overwrite：清掉原来的文件，然后添加

into：追加

4、建立分区表

表中数据比较多，如订单数据量太大，可以按月统计（即按照月份进行分区）

hive>create table beauties (id bigint, name string,size int ) partitioned by (nation string) row format delimited fields terminated by '\t';

load data 需要注意

hive> load data local inpath '/root/b.c' into table beauties partition(nation='china');

分区是怎么存的？

会在表名对应的目录下，再新建一个文件夹

Copying data from file:/root/b.c
 Copying file: file:/root/b.c
 Loading data to table default.beauties partition (nation=china)
 Partition default.beauties{nation=china} stats: [numFiles=1, numRows=0, totalSize=45, rawDataSize=0]
 OKTime taken: 0.739 seconds

例子：

hive csv 建表header hive 建表 default_Time_24

hive csv 建表header hive 建表 default_Time_25

统计时:

针对所有分区：

hive csv 建表header hive 建表 default_hive csv 建表header_26

hive csv 建表header hive 建表 default_数据_27

针对某一个分区：

hive csv 建表header hive 建表 default_hive_28

为什么没有查到？因为元数据库没有记录这个信息。缺少如下图红色框中的数据：

hive csv 建表header hive 建表 default_hive csv 建表header_29

hive> alter table beauties add partition (nation='Japan') location "/beauty/nation=Japan";//添加分区信息
 OK
 Time taken: 0.185 seconds
 hive> select * from beauties;
 OK
 1       bgyjy   56.6565 Japan
 2       jzmb    23.232  Japan
 3       ewrwe   43.9    Japan
 1       glm     34.0    china
 2       lina    30.9    china
 3       liu     45.0    china
 4       bing    56.56   china 
Time taken: 0.104 seconds, Fetched: 7 row(s)

如何添加分区查询（where）

hive> select * from beauties where nation='Japan';
OK
1 bgyjy 56.6565 Japan
2 jzmb 23.232 Japan
3 ewrwe 43.9 Japan

Time taken: 0.253 seconds, Fetched: 3 row(s)

三、从mysql导入数据，利用hive进行统计分析

1、将hive添加到环境变量当中
vim /etc/profile
添加: :/heres/apache-hadoop.../bin

2、在hive当中创建两张表

create table trade_detail (id bigint, account string, income double, expenses double, time string) row format delimited fields terminated by '\t';
 create table user_info (id bigint, account string, name  string, age int) row format delimited fields terminated by '\t';

3、将mysq当中的数据直接导入到hive当中（先将数据导到hdfs中，然后hive再将数据load进来，所有要将hive添加到环境变量）：首先进入sqoop目录中的bin目录中，执行以下：

sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table trade_detail --hive-import --hive-overwrite --hive-table trade_detail --fields-terminated-by '\t'
 sqoop import --connect jdbc:mysql://192.168.1.10:3306/itcast --username root --password 123 --table user_info --hive-import --hive-overwrite --hive-table user_info --fields-terminated-by '\t'

4、创建一个result表保存前一个sql执行的结果

create table result row format delimited fields terminated by '\t' as select t2.account, t2.name, t1.income, t1.expenses, t1.surplus from user_info t2 join (select account, sum(income) as income, sum(expenses) as expenses, sum(income-expenses) as surplus from trade_detail group by account) t1 on (t1.account = t2.account);

四、语句小结

1、创建一个user表
create table user (id int, name string) row format delimited fields terminated by '\t';
2、将本地文件系统上的数据导入到HIVE当中
load data local inpath '/root/user.txt' into table user;
3、创建外部表
create external table stubak (id int, name string) row format delimited fields terminated by '\t' location '/stubak';

4、创建分区表
普通表和分区表区别：有大量数据增加的需要建分区表
create table book (id bigint, name string) partitioned by (pubdate string) row format delimited fields terminated by '\t';

分区表加载数据
load data local inpath './book.txt' overwrite into table book partition (pubdate='2010-08-22');

五、附件

set hive.cli.print.header=true;
CREATE TABLE page_view(viewTime INT, userid BIGINT,
      page_url STRING, referrer_url STRING,
      ip STRING COMMENT 'IP Address of the User')
  COMMENT 'This is the page view table'
  PARTITIONED BY(dt STRING, country STRING)
  ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\001'
 STORED AS SEQUENCEFILE;   TEXTFILE//sequencefile
 create table tab_ip_seq(id int,name string,ip string,country string) 
     row format delimited
     fields terminated by ','
     stored as sequencefile;
 insert overwrite table tab_ip_seq select * from tab_ext; //create & load
 create table tab_ip(id int,name string,ip string,country string) 
     row format delimited
     fields terminated by ','
     stored as textfile;
 load data local inpath '/home/hadoop/ip.txt' into table tab_ext;//external
 CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
      ip STRING,
      country STRING)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
  STORED AS TEXTFILE
  LOCATION '/external/hive';
  // CTAS  用于创建一些临时表存储中间结果
 CREATE TABLE tab_ip_ctas
    AS
 SELECT id new_id, name new_name, ip new_ip,country new_country
 FROM tab_ip_ext
 SORT BY new_id; //insert from select   用于向临时表中追加中间结果数据
 create table tab_ip_like like tab_ip;insert overwrite table tab_ip_like
     select * from tab_ip;//CLUSTER <--相对高级一点，你可以放在有精力的时候才去学习>
 create table tab_ip_cluster(id int,name string,ip string,country string)
 clustered by(id) into 3 buckets;load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_cluster;
 set hive.enforce.bucketing=true;
 insert into table tab_ip_cluster select * from tab_ip;select * from tab_ip_cluster tablesample(bucket 2 out of 3 on id); 
//PARTITION
 create table tab_ip_part(id int,name string,ip string,country string) 
     partitioned by (part_flag string)
     row format delimited fields terminated by ',';
     
 load data local inpath '/home/hadoop/ip.txt' overwrite into table tab_ip_part
      partition(part_flag='part1');
     
     
 load data local inpath '/home/hadoop/ip_part2.txt' overwrite into table tab_ip_part
      partition(part_flag='part2');select * from tab_ip_part;
select * from tab_ip_part  where part_flag='part2';
 select count(*) from tab_ip_part  where part_flag='part2'; alter table tab_ip change id id_alter string;
 ALTER TABLE tab_cts ADD PARTITION (partCol = 'dt') location '/external/hive/dt';show partitions tab_ip_part;
    
 //write to hdfs
 insert overwrite local directory '/home/hadoop/hivetemp/test.txt' select * from tab_ip_part where part_flag='part1';    
 insert overwrite directory '/hiveout.txt' select * from tab_ip_part where part_flag='part1';//array 
 create table tab_array(a array<int>,b array<string>)
 row format delimited
 fields terminated by '\t'
 collection items terminated by ',';

示例数据

tobenbrone,laihama,woshishui     13866987898,13287654321
 abc,iloveyou,itcast     13866987898,13287654321 select a[0] from tab_array;
 select * from tab_array where array_contains(b,'word');
 insert into table tab_array select array(0),array(name,ip) from tab_ext t; //map
 create table tab_map(name string,info map<string,string>)
 row format delimited
 fields terminated by '\t'
 collection items terminated by ';'
 map keys terminated by ':';

示例数据：

fengjie            age:18;size:36A;addr:usa
 furong        age:28;size:39C;addr:beijing;weight:180KG load data local inpath '/home/hadoop/hivetemp/tab_map.txt' overwrite into table tab_map;
 insert into table tab_map select name,map('name',name,'ip',ip) from tab_ext; //struct
 create table tab_struct(name string,info struct<age:int,tel:string,addr:string>)
 row format delimited
 fields terminated by '\t'
 collection items terminated by ','load data local inpath '/home/hadoop/hivetemp/tab_st.txt' overwrite into table tab_struct;
 insert into table tab_struct select name,named_struct('age',id,'tel',name,'addr',country) from tab_ext; //cli shell
 hive -S -e 'select country,count(*) from tab_ext' > /home/hadoop/hivetemp/e.txt

有了这种执行机制，就使得我们可以利用脚本语言（bash shell,python）进行hql语句的批量执行

hive -S -e 'select country,count(*) from DB名.tab_ext'
 select * from tab_ext sort by id desc limit 5;select a.ip,b.book from tab_ext a join tab_ip_book b on(a.name=b.name);
 //UDF
 select if(id=1,first,no-first),name from tab_ext;hive>add jar /home/hadoop/myudf.jar;
 hive>CREATE TEMPORARY FUNCTION my_lower AS 'org.dht.Lower';

把自定义方法名称跟java类关联起来