现有如此三份数据: 

 1、users.dat    数据格式为:  2::M::56::16::70072 

 对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String 

 对应字段中文解释:用户id,性别,年龄,职业,邮政编码 



 2、movies.dat 
 数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy 

 对应字段为:MovieID BigInt, Title String, Genres String 

 对应字段中文解释:电影ID,电影名字,电影类型 



 3、ratings.dat 
 数据格式为:  1::1193::5::978300760 

 对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String 

 对应字段中文解释:用户ID,电影ID,评分,评分时间戳 
题目要求:

 数据要求: 

 (1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗) 

 (2)使用Hive能解析的方式进行 





 Hive不支持多字节的分隔符。(Hive不支持 unicode 编码 > 128的字符 ) 



 补充一个知识点: 

 1、列分隔符的默认值是 

ctrl + a       \x01 

 2、集合的默认分隔符(array类型中的元素值之间的分隔符 和 map类型中 kv和kv之间的分隔符, struct类型各元素之间的分隔符) 

ctrl + b 
  \x02 

 3、map类型中的k和v之间的分隔符: 

ctrl + c 
  \x03 



 Hive要求: 

 (1)正确建表,导入数据(三张表,三份数据),并验证是否正确 



 drop table if exists users; 

 create table users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) 

 row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 

 with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s') 

 stored as textfile; 

 load data local inpath '/home/hadoop/users.dat' INTO TABLE users; 

 select * from users limit 5; 



 drop table if exists movies; 

 create table movies(MovieID BigInt, Title String, Genres String) 

 row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 

 with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s') 

 stored as textfile; 

 load data local inpath '/home/hadoop/movies.dat' INTO TABLE movies; 

 select * from movies limit 5; 



 drop table if exists ratings; 

 create table ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) 

 row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 

 with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s') 

 stored as textfile; 

 load data local inpath '/home/hadoop/ratings.dat' INTO TABLE ratings; 

 select * from ratings limit 5; 


 (2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数) 


两个表链接分组注意: 

 create table result2 as  

 select r.movieid,m.title,count(*) total  

 from ratings r join movies m on r.movieid=m.movieid  

 group by r.movieid,m.title  

 order by total desc limit 10; 


 (3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)(Top10) 


思路: 

where gender = 'M' 

group by movieid, title 

avg(rating) as avgrate 

order by avgrate desc limit 10; 


答案: 

 create table result3 as  

 select r.movieid,m.title,avg(r.rating) as avgrate   

 from ratings r join movies m on r.movieid=m.movieid join users u on r.userid=u.userid  

 where u.gender='M'  

 group by r.movieid,m.title  

 order by avgrate desc limit 10; 


create table result2_1 as select a.movieid as movieid, b.title as title, avg(rating) as avgrate, count(*) as total from ratings a join movies b on a.movieid = b.movieid join users c on a.userid = c.userid  

where c.gender = 'M'  

group by a.movieid, b.title  

having total > 50  

order by avgrate desc limit 10 


 (4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好)的平均影评(年龄段,影评分) 


注意:如果一个需求中,出现了“每”“各”“分别”,那么就一定要进行分组 


select distinct age from users; 


\\如果需要划分年龄段:  case 。。。 when。。。。 

 

group by age 

where movieid = 2116 

avg(rating) as avgrate 

from ratings a join users b on a.userid = b.userid 

select age, avg(rating) as avgrate 

 

答案: 

 create table result4 as  

 select u.age,avg(r.rating) avgrate 

 from users u join ratings r on u.userid=r.userid  

 where r.movieid=2116  

 group by u.age;  


 (5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分) 


1、求出最喜欢看电影的女性 

 select  a.userid, count(*) as total   

 from ratings a join users b on a.userid = b.userid  

 where b.gender = "F"  

 group by a.userid  

 order by total desc limit 1;  


结果: 

1150    1302 


2、求出这个女性评分最高的10部电影 

 select rr.movieid from  

 (select movieid,rating from ratings  where userid=1150 order by rating desc limit 10) rr; 

结果: 

951 

3671 

3307 

1230 

904 

162 

3675 

1966 

3163 

2330 

 

3、求出这10部电影的平均影评分 

 create table result5 as  

 select r.movieid,avg(r.rating) as avgrate  

 from ratings r left semi join (select rr.movieid from  

 (select movieid,rating from ratings  where userid=1150 order by rating desc limit 10) rr) a  

 on r.movieid=a.movieid 

 group by r.movieid; 


 select * from result5; 


 结果:  162     4.063136456211812 

904     4.476190476190476 

951     4.249370277078086 

1230    4.14167916041979 

1966    3.6464646464646466 

2330    4.163043478260869 

3163    3.7039473684210527 

3307    4.387453874538745 

3671    4.047363717605005 

3675    3.8265682656826567 


 (6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影(movieid,  title) 


比如:2012年上映了20部电影,结果只有9部好片。 那么最后的结果是输出9部电影还是10部电影?--》10部。 


意义:求出电影大年(好片最多的年)中的最好看的10部电影 


 1、求出好片最多的年份 



1、构造出年份字段 

2、按照年份(和电影)分组,按照电影的评分排降序,取前1 

3、电影的评分 


one:求每个电影的评分(按照movieid分组) 


// 字段:movieid,  year,  avgrate 

 

 create table result6_1 as  

 select r.movieid,substr(m.title,-5,4) as year,avg(r.rating) as avgrate  

 from ratings r join movies m on r.movieid=m.movieid 

 group by r.movieid,m.title; 


two:求每年好电影的个数(按照年份分组),取top1就是好片最多的年份 


 create table result6_2 as  

 select year,count(*) as total from result6_1  

 where avgrate>=4.0 

 group by year 

 order by total desc limit 1; 

 结果: 1998    27 


 2、求出这一年最好看的10部电影 

 create table result6_3 as  

 select movieid,avgrate from result6_1 

 where year =1998 order by avgrate desc limit 10; 


结果: 

1830    5.0 

2503    4.666666666666667 

2309    4.5 

2930    4.4 

2028    4.337353938937053 

2360    4.3076923076923075 

2839    4.3 

2357    4.283720930232558 

3077    4.2272727272727275 

2329    4.2265625 


 ***************************************************************************************************** 

 group by 字段的用法: 不能是select后面字段的别名 

 order by 字段的用法: 不能是select后面不出现的字段, 可以使用别名 

 补充一点: 

如果使用子查询,一定不要忘记给子查询的结果表取一个别名 

 ****************************************************************************************************** 


 (7)求1997年上映的电影中,评分最高的10部Comedy类电影 


重点问题:判断一个电影是不是comedy电影 


方式:hive内置函数instr可以用来判断,一个字段串在不在另外一个字符串中 

************** 

测试例子: 

SELECT instr(lcase('Comedy|Horror|Thriller'), 'comedy'); 

lcase:将字符串全部变为小写。comedy在字段串中时,结果不为0; 

 

分析: 

where year = 1997   (result6_1) 

where instr(lcase(genres), 'comedy') != 0 

order by avgrate desc limit 10; 

select movieid, avgrate 

from result6_1 a join moveis b on a.movieid = b.movieid 


 答案: 

 create table result7 as  

 select r.movieid,r.avgrate from result6_1 r join movies m on r.movieid=m.movieid  

 where r.year=1997 and instr(lcase(m.genres),'comedy') !=0  

 order by r.avgrate desc limit 10; 

结果: 

2324    4.329861111111111 

1827    4.0 

1871    4.0 

2444    4.0 

1784    3.9501404494382024 

2618    3.891304347826087 

1641    3.872393661384487 

1564    3.8333333333333335 

1734    3.825870646766169 

1500    3.813380281690141 

 

 (8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)(topn) 


核心思路: 按照电影类型分组,按照评价排序,取每组前 5 


前提:求出每种类型的所有电影 


测试例子: 

select explode(split("Horror|Sci-Fi|Thriller","\\|")); 

内置函数split为:按照指定分隔符,切分字符串 

结果: 
 Horror 

Sci-Fi 

Thriller 


需要转换数据: 

3826::Hollow Man (2000)::Horror|Sci-Fi|Thriller 

为: 

3826::Hollow Man (2000)::Horror 

3826::Hollow Man (2000)::Sci-Fi 

3826::Hollow Man (2000)::Thriller 

 

 答案: 

 

 第一步,每种类型的所有电影 

create table result8_1 as  

select movieid, title,  type.movietype    

************************************************************ 

from movies lateral view explode(split(genres,"\\|")) type as movietype; 

( 
 虚拟视图) 

 create table result8_1 as  

 select movieid,title, lv_type.mtype  

 from movies lateral view explode(split(genres,"\\|")) lv_type as mtype; 


result8_1 : movieid, title, movietype 

result6_1 : movieid, year, avgrate 


 第二步, 得让result8_1 和 result6_1链接 


 XXXXX 
 XXXXXXXXXXXXXXXXXXXX 

select a.movietype, a.movieid, b.avgrate  

from result8_1 a join result6_1 b  

on a.movieid = b.movieid  

group by a.movietype, a.movieid;   XXXXXXXXX 

 ******************************************************************** 

得利用新技能实现:  窗口函数:  row_number()  


具体用法: 


row_number() over (distribute by movietype sort by avgrate desc) as row_numer 


row_numer的含义: 就是该条记录在每一组当中的排序的序号 

 ************************************************** 

a 
 jfdls 
 1 

a 
 jhfk 
 2 

a 
 lsdkfl 
 3 

b 
 aa 
 1 

b 
 2304 
 2 

b 
 jlkfjsl 
 3 


分组排序,    distribute by movietype  sort by avgrate desc      

cluster by 


 最终的HQL语句: 


 create table result8_2 as  

 select * from 
带序号的字段 


 (select a.mtype,a.movieid,b.avgrate,row_number() over(distribute by mtype sort by avgrate desc) as rn 

 from result8_1 a join result6_1 b 

 on a.movieid=b.movieid) ab where ab.rn<=5;  


 扩展: 

 create table result8_3 as select a.movietype, a.movieid, b.avgrate, row_number() over (distribute by movietype sort by avgrate desc) as row_numer 

 from result8_1 a join result6_1 b  

 on a.movieid = b.movieid; 


 (9)各年评分最高的电影类型(年份,类型,影评分) 


result6_1 : movieid, year, avgrate 各个电影的评分 

result8_1 : movieid, title, mtype 各种类型的电影 


 create table result9_1 as  

 select a.year,b.mtype,avg(a.avgrate) rate 

 from result6_1 a join result8_1 b on a.movieid = b.movieid  

 group by year,mtype order by year,rate; 


 select * from (select *,row_number() over(distribute by year sort by rate desc) as rn from result9_1) ab 

 where ab.rn=1; 


 (10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分) 


 每个地区 每个电影 评分 

 create table result10_1 as  

 select u.zipcode,m.movieid,avg(r.rating) rate   

 from users u join ratings r on u.userid=r.userid join movies m on r.movieid=m.movieid  

 group by zipcode,m.movieid; 


 select * from (select *,row_number() over(distribute by zipcode sort by rate desc) as rn from result10_1) ab  

 where ab.rn=1; 

............ 

99203   150     5.0     1 

99205   587     5.0     1 

99217   2571    5.0     1 

99224   2133    5.0     1 

99352   1380    5.0     1 

99353   2393    5.0     1 

99504   3097    5.0     1 

99508   1196    5.0     1 

99516   2028    5.0     1 

99701   3615    5.0     1 

99703   1485    5.0     1 

99709   2371    5.0     1 

99801   3342    5.0     1 

99826   2858    5.0     1 

99945   3114    5.0     1