现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat
数据格式为: 2::Jumanji (1995)::Adventure|Children's|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat
数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
数据:
链接:https://pan.baidu.com/s/1VL7Khp9TbEQ2kqDLNeInHA 密码:kfp5
数据要求:
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析':', 不支持解析'::',所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行
Hive要求:正确建表,导入数据(三张表,三份数据),并验证是否正确
create table users(userid BigInt, gender String, age Int, occupation String, zipcode String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s')
stored as textfile;
load data local inpath '/home/hadoop/moviedata/users.dat' INTO TABLE users;
select * from users limit 5;
create table movies(movieid BigInt, name String, type String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s')
stored as textfile;
load data local inpath '/home/hadoop/moviedata/movies.dat' INTO TABLE movies;
select * from movies limit 5;
create table ratings(userid BigInt, movieid BigInt, rate Double, ts String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s')
stored as textfile;
load data local inpath '/home/hadoop/moviedata/ratings.dat' INTO TABLE ratings;
select * from ratings limit 5;
问题1:求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
思路:
要求的字段: 电影名称 movies.name 评分次数 ratings count
核心:按照电影进行分组,求出每组的总评分次数。
select a.movieid, b.name, count(a.userid) as total
from ratings a join movies b on a.movieid = b.movieid
group by a.movieid, b.name
order by total desc
limit 10;
问题2:分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
思路:
要查询的字段: 性别 users.gender 电影名 movies.name 影评分 ratings.rate avg( )
分别求男性和女性评分最高的10部电影
select m.gender,m.name,avg(r.rate) as avgrte,count(r.rate) as count
from ratings r
join users u on r.userid = u.userid
join movies m on r.movieid = m.movieid
where u.gender = 'F'
group by m.movieid,m.name
having count >= 50
order by avgrate desc limit 10;
select m.gender,m.name,avg(r.rate) as avgrte,count(r.rate) as count
from ratings r
join users u on r.userid = u.userid
join movies m on r.movieid = m.movieid
where u.gender = 'M'
group by m.movieid,m.name
having count >= 50
order by avgrate desc limit 10;
问题3:求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
思路:
要查询的字段:年龄段 users.age 影评分 ratings.rate
select b.age as age , avg(a.rate) as avgrate
from ratings a join users b on a.userid = b.userid
where a.movieid = 2116
group by b.age;
问题4:求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)思路
思路:
要查询的字段: 观影者:最喜欢看电影的那个女性;电影名称 movies.name ; 影评分 ratings.rate
核心要点:1、求出最喜欢看电影的女性的ID;
2、求出该女性最喜欢看的10部电影;
3、求出这10部电影的平均影评分;
第一个步骤:要查询的字段users.userid count(ratings.rate)
核心要点:按照userid分组 构建中间表供后续使用。
create table answer4_1 as
select b.userid, count(a.movieid) as total
from ratings a join users b on a.userid = b.userid
where b.gender = "F"
group by b.userid
order by total desc limit 1;
第二个步骤:要查询的字段movies.movieid
核心条件是userid = answer4_1.userid 将查询结果作为中间表 answer4_2
create table answer4_2 as
select a.movieid,a.rate from ratings a join answer4_1 t on a.userid = t.userid
order by a.rate desc limit 10;
第三个步骤:要查询的字段 电影ID answer4_2 .movieid 影评平均分 avg(ratings.rate)
create table answer4_3 as
select a.movieid, c.name, avg(a.rate) as avgrate
from ratings a join answer5_2 b on a.movieid = b.movieid
join movies c on c.movieid = a.movieid
group by a.movieid, c.name;
问题5:求好片(评分>=4.0)最多的那个年份的最好看的10部电影
首先构建一个movie_avgrate_year表,该表中存储所有的movieid, avg(rate), year
create table movie_rate_year as
select a.movieid as movieid, avg(a.rate) as avgrate, substring(b.name, -5, 4) as year
from ratings a join movies b on a.movieid = b.movieid
group by a.movieid, substring(b.name, -5, 4);
第一步:求出好片最多的那个年份;
核心要点:按照好片计数:count(movieid) as total where avgrate >= 4.0
按照年份分组:group by year
按照 total 排序降序
取好片最多的年份,就是total 最大的那个year limit 1 获取这个电影大年 : max_year
create table max_year as
select year, count(movieid) as total
from movie_rate_year
where avgrate >= 4
group by year
order by total desc
limit 1;
第二步:求出电影大年中最好看的10部电影
核心要点:
过滤条件:where year = max_year.year
按照avgrate排降序
limit 10
select a.movieid,a.avgrate
from movie_rate_year a
join max_year b on a.year = b.year
order by a.avgrate desc limit 10;
需求中的异常:
1、如果说好片做多的那个年份中,上映了20部电影,但是只有8部电影是好片。 1998
最终怎么返回结果?
按照我们当前的需求:就返回10部
2、如果好片中,第10,11,12都是4.08分
最终怎么返回? 就返回10部
排名:
98 97 97 96 96 95
1 2 3 4 5 6
1 2 2 3 3 4
1 2 2 4 4 6
这里的排名如果有需要可以参考下面这三个函数的用法:
RANK()
DENSE_RANK()
ROW_NUMBER()