下面是一些常用操作的语句,其他的可以去hive官网查看。


grant create to user aaa;//create不需要on database 或者table 

 

  grant drop on database test to user aaa;//drop需要指定database或者table 

 
 
 
 

  show grant user datajingdo_m on database dbtest_001; 

 
 
 
 

  show grant role ; 

 

  show grant role on database test; 

 

  show grant role on table tab; 

 
 
 
 

  revoke all on database dbtest_001 from user datajingdo_m;


//回收all的权限,前提是grant的是all的权限,all权限包括增删改查,不能使用all来回收create、alter、drop等DDL权限。



//该语句只能用于空数据库(无表)


hive> DROP DATABASE IF EXISTS financials CASCADE; //有表时会递归删除表,再删除数据库



show current roles;查看当前角色



create table test4 like test3; – 除生命周期属性外,test3的其他属性(字段类型,分区类型等)均与test4完全一致



create table test5 as select * from test2;


– 这个操作会创建test5,但分区,生命周期信息不会被拷贝到目标表中。



Hive只支持批量删除和插入 

 

  insert into table aaa111 select count(*) from aaa111; 

 

  insert overwrite  

 
 
 
 

  insert partitions 

 
 
 
 

  truncate table tab_name; 

 
 
 
 

  show partitions tab; 

 
 
 
 

  bucket 

 
 
 
 
 is null :判断空使用is null

显示查询的执行计划



select distinct name from tableAAA 优化: select name from tableAAA group by name



已经每个month,每个sku的销售数量,求每个month最畅销的sku:

select skuId,month from sale_detail a left join (select  
 max 
 (sale_qtty) as sale_qtty_max,month from sale_detail group by month) b on b.sale_qtty_max=a. 
 sale_qtty
 
 
 
 
percentile,  percentile_approx
 
 
 
 
DELETE


    Deletes can only be performed on tables that support ACID. See  Hive Transactions  for details.(detelte只能执行在支持ACID的hive中),


    配置查看文档:


    https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions 



hive 简单查询不启用MapReduce Job 而启用Fetch task,在yarn applications看不到 applicationID: 



//这种查询不会启动MR任务,在yarn applications看不到applicationID



原因在于hive.fetch.task.conversion配置:



        hive -e "set" | grep conversion



        hive.fetch.task.conversion=more    //默认配置为more,当select数据量小于threshold时,不进行MR,直接fetch



//1g大小,只有数据量超过1g,才启动 MR job



启用MapReduce Job是会消耗系统开销的。对于这个问题,从Hive0.10.0版本开始,对于简单的不需要聚合的类似SELECT <col> from <table> LIMIT n语句,不需要起MapReduce job,直接通过Fetch task获取数据,可以通过下面几种方法实现:


hive> set hive.fetch.task.conversion=more; 
 
bin/hive --hiveconf hive.fetch.task.conversion=more
 
 
         hive-site.xml里面加入以下配置:
 
 
 
  
<property>
          <name>hive.fetch.task.conversion</name>
          <value>more</value>
          <description>
            Some select queries can be converted to single FETCH task 
            minimizing latency.Currently the query should be single 
            sourced not having any subquery and should not have
            any aggregations or distincts (which incurrs RS), 
            lateral views and joins.
            1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
            2. more    : SELECT, FILTER, LIMIT only (+TABLESAMPLE, virtual columns)
          </description>
        </property>
 
 
applicationID
 
 
每个hive MR job提交后在yarn上会生成一个applicationID,以application_前缀开始,同时也会生成一个jobID,已job_开始。
 
 
 
 
 
 

    hive的metastore和hiveserver2 由2台增加到4台 
  
 
 

    连接metastore的配置 
  
 
 

    thrift://a06-r12-i32-122.jd.local:9083,thrift://a06-r12-i32-119.jd.local:9083,thrift://a06-r12-i11-24.jd.local:9083,thrift://a06-r12-i11-25.jd.local:9083 
  
 
 直接连接hiveserver2 服务, 使用zk链接进行负载 
 
   <property>
     <name>hive.metastore.uris</name>
     <value>thrift://a06-r12-i32-122.jd.local:9083,thrift://a06-r12-i32-119.jd.local:9083,thrift://a06-r12-i11-24.jd.local:9083,thrift://a06-r12-i11-25.jd.local:9083</value>
   </property> 
 
 
regexp_replace使用 
 
 
regexp_replace(male,'NaN',1-female),female from tmp.sim_user_gender2 where male='NaN'; //字段,值,替换的值 
 
 

  select regexp_replace(name,'智慧','zhihui') from tmp.zhuxian_rate_name_distinct limit 5; 
 
 

 
group by使用
 

  select user_log_acct,male,female,max(dt) as maxDT from sim_user_gender group by user_log_acct limit 3; 
 
 

  FAILED: SemanticException [Error 10025]: Line 1:21 Expression not in GROUP BY key 'male' 
 
 

  select的字段必须在group by里或者为字段的聚合 

 
 
 
 
left join使用
 

   create table tmp.sim_user_gender3 as  
 
 
 
select  
  user_log_acct,male,female from 
   sim_user_gender
 
 
left join  
 
 
 
(select  
  user_log_acct,max(dt) as maxDT from  
  sim_user_gender
 
 
on (a.user_log_acct=b.user_log_acct and a.dt=b.maxDT);


ROW_NUMBER() OVER函数的基本用法(hiveQL语法)


Select *,row_number() OVER (PARTITION BY COL1 ORDER BY

hive> select user_log_acct,male,female,dt,row_number()over(partition by user_log_acct order by dt desc)


*yx_100019279   0.5     0.5     2015-08-16      1



*yx_100019279   0.5     0.5     2015-08-15      2



*yx_100019279   0.5     0.5     2015-08-14      3



*yx_100019279   0.5     0.5     2015-08-13      4



*yx_100019279   0.5     0.5     2015-08-11      5



*yx_100019279   0.5     0.5     2015-08-10      6



*yx_100019279   0.5     0.5     2015-08-09      7



*yx_100019279   0.5     0.5     2015-08-08      8



*yx_100019279   0.5     0.5     2015-08-07      9



*yx_100019279   0.5     0.5     2015-08-06      10


表示根据user_log_acct分组,在分组内部根据dt降序排序,而此函数计算的值就表示每组内部排序后的顺序编号(组内连续的唯一的)






筛选出表temp_user_simularity中两个id对应的name


源表temp_user_simularity结构: (user1,user2,ratings)
 
 
源表dim_user2结构:(usrid,name)
 
 
目标表temp_sim_user_map结构:(user1Name,user2Name,user1ID,user2ID)
 
 
思路:user1Name字段要join一次,user2Name字段要join一次。
 
   

     create table tmp.temp_sim_user_map as 
   

 
    
usr1.name as user1,usr2.name as user2,usr1.usrid as user1ID,usr2.usrid as user2ID 
    

 
    

      from tmp.temp_user_simularity a  
    

 
    
left join

 
    

      on usr1.usrid=a.user1 
    

 
    
left join

 
    

      on usr2.usrid=a.user2
源表temp_user_simularity结构: (user1,user2,ratings)
 
 
源表dim_user2结构:(usrid,name)
 
 
目标表temp_sim_user_map结构:(user1Name,user2Name,user1ID,user2ID)
 
 
思路:user1Name字段要join一次,user2Name字段要join一次。
 
   

     create table tmp.temp_sim_user_map as 
   

 
    
usr1.name as user1,usr2.name as user2,usr1.usrid as user1ID,usr2.usrid as user2ID 
    

 
    

      from tmp.temp_user_simularity a  
    

 
    
left join

 
    

      on usr1.usrid=a.user1 
    

 
    
left join

 
    

      on usr2.usrid=a.user2



hive 导出分区数据到txt中,再添加到其他hive集群:


导出:


./bin/hive -e "select * from testdb_001.employees" > data/ImportEmployees.txt




(2)创建表:




CREATE EXTERNAL TABLE `app.app_user_profile_1_da`( 
 
 
 

     `user_log_acct` string COMMENT 'user id',  
 
 
 

     `cate` string COMMENT 'cid3 and cid3 name',  
 
 
 

     `features` array<string> COMMENT 'feature tags') 
 
 
 

   PARTITIONED BY (  
 
 
 

     `dt` string,  
 
 
 

     `type` string) 
 
 
 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 
 

   STORED AS INPUTFORMAT  
 
 
 

     'org.apache.hadoop.mapred.TextInputFormat'  
 
 
 

   OUTPUTFORMAT  
 
 
 

     'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 
 
 

   TBLPROPERTIES ( 
 
 
 

     'transient_lastDdlTime'='1453368848') 
 
 
  
 
 
(3)添加分区:
 
 
 
alter table app_user_profile_1_da add partition
 
 
 
  
 
 
load数据: 
 
 
 
partition(dt='2016-05-24',type='view-shop'); 
  
 
  
 
   
 
 
 
  
 
 
distinct



select spu_name,count( distinct user_log_acct) from tmp.nonzhuxian_items group by spu_name





导出hive表到本地csv中(导出数据默认的分隔符是\t,如果string类型字段中包含\t或者逗号,要进行替换):

hive -e "select regexp_replace(a,'

\t+|,+','-'),b,countab,counta,confidenceab,supporta,lift from tmp.zhuxian_prefixt_ab" | awk -F '\t' '{print $1","$2","$3","$4","$5","$6","$7}'>>zhuxian_prefix_ab.csv




hive mapper数量:


    主要的决定因素有: input的文件总个数,input的文件大小,集群设置的文件块大小(目前为128M, 可在hive中通过set dfs.block.size;命令查看到,该参数不能自定义修改);


举例:  


    a)    假设input目录下有1个文件a,大小为780M,那么hadoop会将该文件a分隔成7个块(6个128m的块和1个12m的块),从而产生7个map数


    b)    假设input目录下有3个文件a,b,c,大小分别为10m,20m,130m,那么hadoop会分隔成4个块(10m,20m,128m,2m),从而产生4个map数


            即,如果文件大于块大小(128m),那么会拆分,如果小于块大小,则把该文件当成一个块。

是不是保证每个map处理接近128m的文件块,就高枕无忧了?  


答案也是不一定。比如有一个127m的文件,正常会用一个map去完成,但这个文件只有一个或者两个小字段,却有几千万的记录,

如果map处理的逻辑比较复杂,用一个map任务去做,肯定也比较耗时。

针对上面的问题3和4,我们需要采取两种方式来解决:即减少map数和增加map数;


Hive自己如何确定reduce数:  



reduce个数的设定极大影响任务执行效率,不指定reduce个数的情况下,Hive会猜测确定一个reduce个数,基于以下两个设定:

hive.exec.reducers.bytes.per.reducer(每个reduce任务处理的数据量,默认为1000^3=1G) 

hive.exec.reducers.max(每个任务最大的reduce数,默认为999)

计算reducer数的公式很简单N=min(参数2,总输入数据量/参数1)

即,如果reduce的输入(map的输出)总大小不超过1G,那么只会有一个reduce任务;



hive日期函数, year, month, day,week:



month(dt) in(3,5,7,9)






表app.userprofile是一个包含561168636条数据的大表,gdm_m04_ord_det_sum是包含1344346条数据的表,要将userprofile中包含后者中的id的数据取出来:


select a.* from app.userprofile a 


 
 


join (select user_log_acct from gdm.gdm_m04_ord_det_sum where month(dt)=7 group by user_log_acct) b 


 
 


on a.user_log_acct=b.user_log_acct and a.dt='2016-10-12'



实际上,下面要比上面快很多(先将子查询的数据写到表里,然后再join):






val glasses_users = hiveContext.sql("select user_log_acct from gdm.gdm_m04_ord_det_sum where item_third_cate_name like '%太阳镜%' and month(dt)=7 group by user_log_acct")







glasses_users.write.saveAsTable("tmp2.glasses_users")







val glasses_users_portrait = hiveContext.sql("select a.* from app.app_ba_userprofile_prop_api_mkt_sub a join tmp2.glasses_users b on a.user_log_acct=b.user_log_acct and a.dt='2016-10-12'")







glasses_users_portrait.write.saveAsTable("tmp2.glasses_user_portrait")





要用小表去join大表:


select b.* from tmp2.glasses_users a join app.app_ba_userprofile_prop_api_mkt_sub b on a.user_log_acct=b.user_log_acct and b.dt='2016-10-12'


分段查询:



select sum(CASE when



select count(if(cvl_glob_fmargval<60,1,null)),count(if(cvl_glob_fmargval>=70 and cvl_glob_fmargval<80,1,null)) from userProfiles