在虚拟机上删除hdfs文件:
删除hdfs下的文件: hadoop fs -rm /文件夹名
删除hdfs下的文件夹:hadoop fs -rm -r 文件名
查看根目录下的内容: hadoop fs -cat /文件名
增加权限:chmod +x 文件名
ps -ef|grep 文件名 :查看某个进程号
复制文件: cp 文件名 文件名
mv 既可以改文件名也可以移动文件(在同一台机器上可以)
将一台机器上的文件复制到另一台机器:scp 文件路径 hdp-2:$PWD
-r 递归 移动文件夹
批量替换文件里面的内容:
1。vi a.txt(文件名)
2。按esc键
3。按shift+:
4。在后面输入:%s/a/b/g
注:s:代表 substitute(替换)%:代表所有行 a替换成b g:代表global(全体的)
在hive中进行操作:
(hive的数据分为两部分:一部分真实数据,存放在HDFS上,另一部分是真实数据的元数据,一边存放在MySQL上,可以修改 hive-site.xml来完成)
首先进入hive:
1.启动hiveserver
2.启动hive服务端(克隆一台)
另一种启动hive (直接启动):
小提示:在做小数据量查询测试时,可以让hive将mrjob提交给本地运行器运行,可以在hive会话中设置如下参数:(语法执行的快)
hive> set hive.exec.mode.local.auto=true;
hive> set hive.exec.mode.local.auto=false;(退出 )
设置一些基本参数,让hive使用起来更便捷,比如:
让提示符显示当前库: hive>set hive.cli.print.current.db=true;
显示查询结果时显示字段名称:hive>set hive.cli.print.header=true;
创建表:create table 库名.表名(id string,name string ,age string);
例:create table person.teacher(id string,name string,age string) row format delimited fields terminated by ',';
往表中插入数据:insert into 表名 values('id' ,'name' ,'age' );
例:insert into teacher values ('1','zhangsan','25');
删除表:drop table 表名;
查看表结构:desc 表名;
更加详细查看表的信息: desc formatted 表名;
创建外部表(external table):要在创建表的时候自己定义路径,不可以是hive默认的存储路径:/user/hive/warehouse
在hive上删除外部表时(drop 表名):只会删除元数据,在hdfs上表仍会存在
创建内部表(managed table):再创建表的时候不用定义表路径,
在hive上删除内部表时,会将元数据和hdfs上的数据都删除
元数据:就是数据的描述信息,例如表的字段、存储位置、时间、大小
一个分区字段的实例:
示例如下:
1.创建带分区的表
create table person.t_access(ip string,url string,access_time string)
partitioned by(dt string) #分区
row format delimited
fields terminated by ',';
2.做假数据 在/root/hive_data 里面创建access.log1文件
192.168.33.2,http://www.sina.com/a,2019-10-17 20:08:24
192.168.33.3,http://www.sina.com/b,2019-10-17 21:08:24
192.168.33.4,http://www.sina.com/a,2019-10-17 22:08:24
192.168.33.5,http://www.sina.com/c,2019-10-17 23:08:24
192.168.33.2,http://www.sina.com/a,2019-10-17 24:08:24
192.168.33.7,http://www.sina.com/b,2019-10-17 20:09:24
192.168.33.2,http://www.sina.com/a,2019-10-17 20:12:24
另一组假数据 access.log2
192.168.33.2,http://www.sina.com/a,2019-10-18 20:08:24
192.168.33.3,http://www.sina.com/b,2019-10-18 21:08:24
192.168.33.4,http://www.sina.com/a,2019-10-18 22:08:24
192.168.33.5,http://www.sina.com/c,2019-10-18 23:08:24
192.168.33.2,http://www.sina.com/a,2019-10-18 24:08:24
192.168.33.7,http://www.sina.com/b,2019-10-18 20:09:24
192.168.33.2,http://www.sina.com/a,2019-10-18 20:12:24
3.向分区中导入数据:(在hive下进行),将两组假数据传入到表中
load data local inpath '/root/hive_data/access.log1' into table t_access partition(dt='20191017');(20191017带引号原因是string类型)
load data local inpath '/root/hive_data/access.log2' into table t_access partition(dt='20191017');
4.针对分区数据进行查询
a.统计10月17号的总PV
select count(*) from t_access where dt='20170804';
5.通过已存在表来建表:
create table 表1 like 表2;
(新建的表1的结构定义和表2一致,但是没有数据),可以 desc 查看
select count(*) from 表名;
select max(ip) from 表名;
6.可以在建表的同时插入数据
create table 表1
as
select * from 表2;
1.数据的导入导出
1.1将数据文件导入hive的表
方式一:导入数据的一种方式:(手动用hdfs命令,将文件放入表目录(hive))
hadoop fs -put 本地文件 hdfs路径(hive里也会出现数据)
例:hadoop fs -put access.log3 /user/hive/warehouse/person.db/t_access
方式二:在hive的交互式shell中用hive命令来导入本地文件到表路径中
load data local inpath 本地路径 into table 表名;(hive和hdfs上都会有数据)
例: load data local inpath '/root/hive_data/p3.data' into table teacher;
方式三:用hive命令导入hdfs中的数据文件到表目录中
load data inpath ‘hdfs上文件路径’ into table 表名 partition(分区名='用什么进行分区')
例:load data inpath '/user/hive/warehouse/person.db/t_access/access.log3' into table t_access partition(dt='20191018');
注意:导本地文件和导hdfs文件区别:
本地文件导入表:复制。hdfs导入表:移动
(命令中由local是导入的本地文件,没有的话导入的是hdfs文件)
数字类型;
java : boolean 、char 、byte、 short 、int 、long 、float、 double
2.复合类型
2.1 array数组类型
假如有如下数据需要用hive表去映射。
Zhan 2,wujing:liushishi:yangyang,2017-08-16
Zhan 3,fanbingbing:lichen,2017-08-20
设想:如果主演信息用一个数组来映射比较方便
在hive中建表:
create table t_movie(moive_name string,actors array<string>,first_show date)
row format delimited fields terminated by ','
collection items terminated by ':';
导入数据(在hive下):load data local inpath '/root/hive_data/movie.dat' into table t_movie;
查询:
select t_movie.movie_name,actors[0] from t_movie;(查询电影名字,和第一个演员)
select t_movie.movie_name,actors from t_movie where array_contains(actors,'fanbingbing');(包含fanbingbing的)
select t_movie.movie_name,size(actors) as actor_num from t_movie;(as 起别名。查看电影中演员总人数)
2.2 map类型
1)假如有以下数据(在本地里创建)
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26
可以用一个map类型来对上述数据中的家庭成员进行描述
在hive下建表:
create table t_person(id int,name string,family_members map<string,string>,age int)
row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';
导入数据(在hive下):load data local inpath ‘/root/hive_data/relation_data’ into table t_person;
查询
select * from t_person;
+--------------+----------------+----------------------------------------------------------------+---------------+--+
| t_person.id | t_person.name | t_person.family_members | t_person.age |
+--------------+----------------+----------------------------------------------------------------+---------------+--+
| 1 | zhangsan | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} | 28 |
| 2 | lisi | {"father":"mayun","mother":"huangyi","brother":"guanyu"} | 22 |
| 3 | wangwu | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} | 29 |
| 4 | mayun | {"father":"mayongzhen","mother":"angelababy"} | 26 |
+--------------+----------------+----------------------------------------------------------------+---------------+--+
## 取map字段的指定key的值
select t_person.family_members['father'] as father from t_person;
+--------------+--+
| father |
+--------------+--+
| xiaoming |
| mayun |
| wangjianlin |
| mayongzhen |
+--------------+--+
##取map字段中所有key值(取values值一样)
select id, name,map_keys(t_person.family_members) as relation from t_person;
+-----+-----------+--------------------------------+--+
| id | name | relation |
+-----+-----------+--------------------------------+--+
| 1 | zhangsan | ["father","mother","brother"] |
| 2 | lisi | ["father","mother","brother"] |
| 3 | wangwu | ["father","mother","sister"] |
| 4 | mayun | ["father","mother"] |
+-----+-----------+--------------------------------+--+
## 综合:查询有brother的用户信息
select id,name,father from
(select id,name,t_person.family_members['brother'] as father from t_person) tmp
where father is not null;
先查询括号里的,命名为tmp ,然后在where father is not null ,最后select id,name,father
+-----+-----------+---------+--+
| id | name | father |
+-----+-----------+---------+--+
| 1 | zhangsan | xiaoxu |
| 2 | lisi | guanyu |
+-----+-----------+---------+--+
2.3struct 类型
1)假设有以下数据
1,zhangsan,18:male:beijing
2,lisi,28:female:shanghai
其中的用户信息包含:年龄:整数,性别:字符串,地址:字符串
设想用一个字段来描述整个用户信息,可以采用struct
2) 建表
create table t_person_struct(id int,name string,info struct<age:int,sex:string,addr:string>)
row format delimited fields terminated by ','
collection items terminated by ':';
导入数据(在hive中):load data local inpath '/root/hive_data/t_person/' into table t_person_struct;
+---------------------+-----------------------+----------------------------------------------+--+
| t_person_struct.id | t_person_struct.name | t_person_struct.info |
+---------------------+-----------------------+----------------------------------------------+--+
| 1 | zhangsan | {"age":18,"sex":"male","addr":"beijing"} |
| 2 | lisi | {"age":28,"sex":"female","addr":"shanghai"} |
+---------------------+-----------------------+----------------------------------------------+--+
查询:select id,name,info.age from t_person_struct;
+-----+-----------+------+--+
| id | name | age |
+-----+-----------+------+--+
| 1 | zhangsan | 18 |
| 2 | lisi | 28 |
+-----+-----------+------+--+
3.修改表定义
3.1修改表名
alter table t_1 rename to t_x;
3.2 修改分区表中分区字段的值
alter table t_1 partition(department='xiangsheng',sex='male',howold=20) rename to partition(department='1',sex='1',howold=20);
注意:如果要修改的值有相同的话,都会被修改
3.3 删除分区表中数据
alter table t_1 partition (dt=20190804');
4.hive 查询语法
4.1 基本查询示例
假如有 a.txt
a,1
b,2
c,3
d,4
假如有b.txt
a,xx
b,yy
d,zz
e,pp
建表:
create table t_a (name string,num string) row format delimited fields terminated by ',';
create table t_b(name string,num string) row format delimited fields terminated by ',';
导入数据(在hive中)
内连接(右连接)
select a.*,b.* from t_a a join t_b b;
+---------+--------+---------+--------+--+
| a.name | a.num | b.name | b.num |
+---------+--------+---------+--------+--+
| a | 1 | a | xx |
| b | 2 | a | xx |
| c | 3 | a | xx |
| d | 4 | a | xx |
| a | 1 | b | yy |
| b | 2 | b | yy |
| c | 3 | b | yy |
| d | 4 | b | yy |
| a | 1 | d | zz |
| b | 2 | d | zz |
| c | 3 | d | zz |
| d | 4 | d | zz |
| a | 1 | e | pp |
| b | 2 | e | pp |
| c | 3 | e | pp |
| d | 4 | e | pp |
+---------+--------+---------+--------+--+
左连接;select a.*,b.* from t_a a left outer join t_b b;
+---------+--------+---------+--------+--+
| a.name | a.num | b.name | b.num |
+---------+--------+---------+--------+--+
| a | 1 | a | xx |
| a | 1 | b | yy |
| a | 1 | d | zz |
| a | 1 | e | pp |
| b | 2 | a | xx |
| b | 2 | b | yy |
| b | 2 | d | zz |
| b | 2 | e | pp |
| c | 3 | a | xx |
| c | 3 | b | yy |
| c | 3 | d | zz |
| c | 3 | e | pp |
| d | 4 | a | xx |
| d | 4 | b | yy |
| d | 4 | d | zz |
| d | 4 | e | pp |
+---------+--------+---------+--------+--+
group by 分组聚合
(有Group by的时候,那个group by 会在select、alter、update 等之前运行)
select dt,count(*),max(ip) from t_access group by dt;
max 最大
select dt,count(*),max(ip) as cnt from t_access group by dt having dt>'20170804';
having dt>'20170804' 的意思是表中 的dt大于20170804这个数字
执行顺序:先groupby 再having,在再select。having是对group by聚合之后的结果进行再过滤;