文章目录
- 数据描述
- 数据要求
- 题目
- 1. 正确建表,导入数据(三张表,三份数据),并验证是否正确
- 2. 求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
- 3. 分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
- 4. 求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
- 5. 求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
- 6. 求好片(评分>=4.0)最多的那个年份(上映)的最好看的10部电影
- 7. 求1997年上映的电影中,评分最高的10部Comedy类电影
- 8. 该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
- 9. 各年评分最高的电影类型(年份,类型,影评分)
- 10. 每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
数据描述
- users.dat 用户表
数据格式为: 2::M::56::16::70072
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码 - movies.dat 电影表
数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型 - ratings.dat 评分表
数据格式为: 1::1193::5::978300760
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
数据要求
(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
(2)使用Hive能解析的方式进行
要求:外部表
hdfs: /user/data/yingping/movies ratings users
题目
-- 设置本地模式 及 显示表头
hive> set hive.exec.mode.local.auto=true;
hive> set hive.cli.print.header=true;
-- hdfs 数据目录创建
hive> dfs -mkdir -p /user/data/yingping/movies;
hive> dfs -mkdir -p /user/data/yingping/ratings;
hive> dfs -mkdir -p /user/data/yingping/users;
-- hdfs 数据目录 上传数据
hive> dfs -put users.dat /user/data/yingping/users;
hive> dfs -put movies.dat /user/data/yingping/movies;
hive> dfs -put ratings.dat /user/data/yingping/ratings;
-- 创建yingping数据库
hive> create database yingping;
hive> use yingping;
1. 正确建表,导入数据(三张表,三份数据),并验证是否正确
-- 创建表 movies
create external table if not exists movies(
MovieID BigInt, Title String, Genres String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s') stored as textfile
location '/user/data/yingping/movies';
hive> select * from movies limit 2;
OK
movies.movieid movies.title movies.genres
1 Toy Story (1995) Animation|Children's|Comedy
2 Jumanji (1995) Adventure|Children's|Fantasy
Time taken: 1.142 seconds, Fetched: 2 row(s)
-- 创建表 ratings
create external table if not exists ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s') stored as textfile
location '/user/data/yingping/ratings';
hive> select * from ratings limit 2;
OK
ratings.userid ratings.movieid ratings.rating ratings.timestamped
1 1193 5.0 978300760
1 661 3.0 978302109
Time taken: 0.145 seconds, Fetched: 2 row(s)
-- 创建表 users
create external table if not exists users(userid bigint,gender string,age int,occupation string,zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s') stored as textfile location '/user/data/yingping/users';
hive> select * from users limit 2;
OK
users.userid users.gender users.age users.occupation users.zipcode
1 F 1 10 48067
2 M 56 16 70072
Time taken: 0.137 seconds, Fetched: 2 row(s)
2. 求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
----分析
--也就是要先求出,每一部电影总评分次数 排序
分组:电影id title ->movies表
求:count() ->ratings表
排序:count() desc limit 10
select
a.Title,count(b.Rating) totalRate
from movies a join ratings b
on a.MovieID=b.MovieID
group by a.MovieID,a.Title
order by totalRate desc limit 10;
--结果
a.title totalrate
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
3. 分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
方法一:分开求
----分析
--1.求男性
过滤条件:男性 ->users表
每一部电影的平均评分
分组:电影Id 电影名 ->movies表
求:avg(Rating) -> ratings表
排序:avg(Rating) desc limit 10
select
c.Gender Gender,
a.Title Title,
avg(b.Rating) avgrate
from movies a join ratings b on a.MovieID=b.MovieId
join users c on b.UserID=c.UserID
where c.Gender='M'
group by a.MovieID,a.Title,c.Gender
order by avgrate desc limit 10;
--结果
gender title avgrate
M Small Wonders (1996) 5.0
M Smashing Time (1967) 5.0
M Baby, The (1973) 5.0
M Gate of Heavenly Peace, The (1995) 5.0
M Schlafes Bruder (Brother of Sleep) (1995) 5.0
M Dangerous Game (1993) 5.0
M Follow the Bitch (1998) 5.0
M Bells, The (1926) 5.0
M Lured (1947) 5.0
M Angela (1995) 5.0
--2.求女性
select
c.Gender Gender,a.Title Title,avg(b.Rating) avgrate
from
movies a join ratings b
on a.MovieID=b.MovieID join users c on b.UserID=c.UserID
where c.Gender='F'
group by a.MovieID,a.Title,c.Gender
order by avgrate desc limit 10;
--结果
gender title avgrate
F For the Moment (1994) 5.0
F Gate of Heavenly Peace, The (1995) 5.0
F Skipped Parts (2000) 5.0
F Raw Deal (1948) 5.0
F Clean Slate (Coup de Torchon) (1981) 5.0
F Coldblooded (1995) 5.0
F Country Life (1994) 5.0
F Message to Love: The Isle of Wight Festival (1996) 5.0
F Bittersweet Motel (2000) 5.0
F Lamerica (1994) 5.0
方法二:分析函数
----分析
--每个性别中评分最高的10部电影
--每个性别中每部电影的评分
分组:Gender MovieID Title
求:avg
select
c.Gender Gender,a.Title Title,avg(b.Rating) avgrate
from movies a join ratings b on a.MovieID=b.MovieID
join users c on b.UserID=c.UserID
group by c.Gender,a.MovieID,a.Title;
--分组求topN
局部排序
分组:Gender
排序: avgrate
select * from
(select
Gender,Title,avgrate,row_number() over(distribute by Gender sort by avgrate desc) index
from
(select
c.Gender Gender,a.Title Title,avg(b.Rating) avgrate
from movies a join ratings b on a.MovieID=b.MovieID
join users c on b.UserID=c.UserID
group by c.Gender,a.MovieID,a.Title
) d) e where index<=10;
--结果
e.gender e.title e.avgrate e.index
F Brother, Can You Spare a Dime? (1975) 5.0 1
F Woman of Paris, A (1923) 5.0 2
F Ayn Rand: A Sense of Life (1997) 5.0 3
F One Little Indian (1973) 5.0 4
F Song of Freedom (1936) 5.0 5
F Belly (1998) 5.0 6
F Gambler, The (A J�t�kos) (1997) 5.0 7
F Clean Slate (Coup de Torchon) (1981) 5.0 8
F Country Life (1994) 5.0 9
F Other Side of Sunday, The (S�ndagsengler) (1996) 5.0 10
M Bells, The (1926) 5.0 1
M Dangerous Game (1993) 5.0 2
M Follow the Bitch (1998) 5.0 3
M Small Wonders (1996) 5.0 4
M Schlafes Bruder (Brother of Sleep) (1995) 5.0 5
M Angela (1995) 5.0 6
M Baby, The (1973) 5.0 7
M Ulysses (Ulisse) (1954) 5.0 8
M Gate of Heavenly Peace, The (1995) 5.0 9
M Smashing Time (1967) 5.0 10
4. 求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
--分析
过滤:movieID=2116 -> ratings表
分组:年龄 -> users表
求:avg ->ratings表
select
a.Age,avg(b.Rating) avgrate
from users a join ratings b
on a.UserID=b.UserID
where b.MovieID=2116
group by a.Age;
--结果
a.age avgrate
1 3.2941176470588234
18 3.3580246913580245
25 3.436548223350254
35 3.2278481012658227
45 2.8275862068965516
50 3.32
56 3.5
5. 求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
首先这个题目可能有点歧义:
1.求,这个女性评分最高的10部电影的总平均分
2. 求,这个女生评分最高的10部电影的,每一电影的平均分
--如果是2.这个女生评分最高的10部电影的,每一电影的平均分
----分析
--1) 先找到这位女性
过虑:Gender="F" ->users表
分组:UserId ->ratings表
求:count() ->ratings表
排序:count() desc limit 1
create table answer05_step01 as
select
a.UserID UserID,count(a.Rating) totalRate
from ratings a join users b
on a.UserID=b.UserID
where b.Gender="F"
group by a.UserID
order by totalRate desc limit 1;
--结果
userid totalrate
1150 1302
--2) 然后这个女性最喜欢的10部电影
--这个女生评分最高的10部电影
过滤:女性 1150 ->answer05_step01表
排序:评分 desc limit10 ->ratings表
create table answer05_step02 as
select
a.UserID UserID,a.MovieID MovieID,a.Rating Rating
from ratings a join answer05_step01 b on
a.UserID=b.UserID
order by Rating desc limit 10;
----****注意 where子查询运用一个查询结果问题***
--结果(这位女性评分最高的10部电影)
hive> select * from answer05_step02;
OK
answer05_step02.userid answer05_step02.movieid answer05_step02.rating
1150 745 5.0
1150 1279 5.0
1150 1236 5.0
1150 904 5.0
1150 750 5.0
1150 2997 5.0
1150 2064 5.0
1150 905 5.0
1150 1094 5.0
1150 1256 5.0
Time taken: 0.128 seconds, Fetched: 10 row(s)
--3) 她喜欢的10部电影的平均影评 (每一部电影的平均影评)
分组:MovieID ->answer05_step02表
求 avg -> ratings表
select
a.MovieID MovieID,avg(b.Rating) avgrate
from answer05_step02 a left join ratings b
on a.MovieID=b.MovieID
group by a.MovieID;
--结果
movieid avgrate
745 4.52054794520548
750 4.4498902706656915
904 4.476190476190476
905 4.280748663101604
1094 3.7314890154597236
1236 4.188888888888889
1256 4.21043771043771
1279 3.747422680412371
2064 4.0739348370927315
2997 4.125390450691656
----上面3步联合查询就是
select
a.MovieID MovieID,avg(b.Rating) avgrate
from
(
select
a.UserID UserID,a.MovieID MovieID,a.Rating Rating
from ratings a join
(
select
a.UserID UserID,count(a.Rating) totalRate
from ratings a join users b
on a.UserID=b.UserID
where b.Gender="F"
group by a.UserID
order by totalRate desc limit 1
) b on
a.UserID=b.UserID
order by Rating desc limit 10
)
a left join ratings b
on a.MovieID=b.MovieID
group by a.MovieID;
--结果
movieid avgrate
162 4.063136456211812
904 4.476190476190476
951 4.249370277078086
1230 4.14167916041979
1966 3.6464646464646466
2330 4.163043478260869
3163 3.7039473684210527
3307 4.387453874538745
3671 4.047363717605005
3675 3.8265682656826567
---联合查询时候,会替换上边儿分步表,那我们实验一下,表别名重复影响不影响?
--实验发现他们局部表名,不影响
select
a.MovieID MovieID,avg(b.Rating) avgrate
from
(
select
c.UserID UserID,c.MovieID MovieID,c.Rating Rating
from ratings c join
(
select
e.UserID UserID,count(e.Rating) totalRate
from ratings e join users f on e.UserID=f.UserID
where f.Gender="F"
group by e.UserID
order by totalRate desc limit 1
) d
on
c.UserID=d.UserID
order by Rating desc limit 10
) a left join ratings b
on a.MovieID=b.MovieID
group by a.MovieID;
--结果
movieid avgrate
162 4.063136456211812
904 4.476190476190476
951 4.249370277078086
1230 4.14167916041979
1966 3.6464646464646466
2330 4.163043478260869
3163 3.7039473684210527
3307 4.387453874538745
3671 4.047363717605005
3675 3.8265682656826567
---但是,我发现联合查询与分步查询的结果不一致???
6. 求好片(评分>=4.0)最多的那个年份(上映)的最好看的10部电影
----分析
--好片最多的那个年份?
--每一部电影的平均评分
--每一年评分>=4.0的电影部数 按这个部数排序 desc limit1
--1.每一年每一部电影的平均评分
分组:年份 电影 -> movies表
求: avg(Ratings) ->ratings表
--电影表数据是这样的,电影名称截取年份
movies.movieid movies.title movies.genres
1 Toy Story (1995) Animation|Children's|Comedy
create table year_movie_avgrate as
select
substr(a.Title,-5,4) year,a.MovieID MovieID,a.Title Title,avg(b.Rating) avgrate
from movies a join ratings b
on a.MovieID=b.MovieID
group by a.MovieID,a.Title,substr(a.Title,-5,4);
--year_movie_avgrate 存入数据样例
year movieid title avgrate
1995 1 Toy Story (1995) 4.146846413095811
1995 2 Jumanji (1995) 3.20114122681883
1995 3 Grumpier Old Men (1995) 3.01673640167364
--2. 好片最多的那个年份
分组:年份
过滤 >=4.0
求:count()
排序:count() desc limit 1
select
year,count(*) totalMovie
from year_movie_avgrate
where avgrate>=4.0
group by year
order by totalMovie desc limit 1;
--结果
year totalmovie
1998 27
--3. 求这一年最好看的10部电影
过滤:1998 ->year_movie_avgrate表
排序:评分 倒序 limit 10
select
b.MovieID MovieID,b.Title Title,b.avgrate avgrate
from (select
year,count(*) totalMovie
from year_movie_avgrate
where avgrate>=4.0
group by year
order by totalMovie desc limit 1) a
left join year_movie_avgrate b
on a.year=b.year
order by avgrate desc limit 10;
--结果
movieid title avgrate
1830 Follow the Bitch (1998) 5.0
2503 Apple, The (Sib) (1998) 4.666666666666667
2309 Inheritors, The (Die Siebtelbauern) (1998) 4.5
2930 Return with Honor (1998) 4.4
2028 Saving Private Ryan (1998) 4.337353938937053
2360 Celebration, The (Festen) (1998) 4.3076923076923075
2839 West Beirut (West Beyrouth) (1998) 4.3
2357 Central Station (Central do Brasil) (1998) 4.283720930232558
3077 42 Up (1998) 4.2272727272727275
2329 American History X (1998) 4.2265625
7. 求1997年上映的电影中,评分最高的10部Comedy类电影
----分析
year_movie_avgrate表: year movieid title avgrate
过滤:1997& Comedy -> movies表
排序:评分 desc limit 10
select
a.year year,a.MovieID MovieID,a.Title Title,a.avgrate avgrate,b.Genres Genres
from year_movie_avgrate a join movies b
on a.MovieID=b.MovieID
where a.year="1997" and instr(lower(b.Genres),"comedy")>0
order by avgrate desc limit 10;
--结果
year movieid title avgrate genres
1997 2324 Life Is Beautiful (La Vita � bella) (1997) 4.329861111111111 Comedy|Drama
1997 2444 24 7: Twenty Four Seven (1997) 4.0 Comedy|Drama
1997 1827 Big One, The (1997) 4.0 Comedy|Documentary
1997 1871 Friend of the Deceased, A (1997) 4.0 Comedy|Drama
1997 1784 As Good As It Gets (1997) 3.9501404494382024 Comedy|Drama
1997 2618 Castle, The (1997) 3.891304347826087 Comedy
1997 1641 Full Monty, The (1997) 3.872393661384487 Comedy
1997 1564 Roseanna's Grave (For Roseanna) (1997) 3.8333333333333335 Comedy|Romance
1997 1734 My Life in Pink (Ma vie en rose) (1997) 3.825870646766169 Comedy|Drama
1997 1500 Grosse Pointe Blank (1997) 3.813380281690141 Comedy|Crime
8. 该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
----分析
分组:类型 MovieID ->movies表
求:avg ->ratings表
--1. 炸裂,电影类型
select
MovieID,Title,tps.tp type
from movies
lateral view explode(split(Genres,"\\|")) tps as tp;
-- 数据
movieid title type
1 Toy Story (1995) Animation
1 Toy Story (1995) Children's
1 Toy Story (1995) Comedy
2 Jumanji (1995) Adventure
2 Jumanji (1995) Children's
--2. 关联求 每种类型 每一部电影的平均分
select
lower(a.type) type,a.MovieID MovieID,a.Title Title,avg(b.Rating) avgrate
from (
select
MovieID,Title,tps.tp type
from movies
lateral view explode(split(Genres,"\\|")) tps as tp
) a join ratings b
on a.MovieID=b.MovieID
group by lower(a.type),a.MovieID,a.Title;
--结果
type movieid title avgrate
action 6 Heat (1995) 3.8787234042553194
action 9 Sudden Death (1995) 2.656862745098039
action 10 GoldenEye (1995) 3.5405405405405403
action 15 Cutthroat Island (1995) 2.458904109589041
action 20 Money Train (1995) 2.5375
--3. 分组求topN
分组:type
排序:avgrate desc
select * from(
select
c.type type,c.MovieID MovieID,c.Title Title,c.avgrate avgrate,row_number() over(partition by c.type order by c.avgrate desc) index
from(
select
lower(a.type) type,a.MovieID MovieID,a.Title Title,avg(b.Rating) avgrate
from (
select
MovieID,Title,tps.tp type
from movies
lateral view explode(split(Genres,"\\|")) tps as tp
) a join ratings b
on a.MovieID=b.MovieID
group by lower(a.type),a.MovieID,a.Title
) c
)d where index<=5;
--结果:
d.type d.movieid d.title d.avgrate d.index
action 2905 Sanjuro (1962) 4.608695652173913 1
action 2019 Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560509554140127 2
action 858 Godfather, The (1972) 4.524966261808367 3
action 1198 Raiders of the Lost Ark (1981) 4.477724741447892 4
action 260 Star Wars: Episode IV - A New Hope (1977) 4.453694416583082 5
adventure 3172 Ulysses (Ulisse) (1954) 5.0 1
adventure 2905 Sanjuro (1962) 4.608695652173913 2
adventure 1198 Raiders of the Lost Ark (1981) 4.477724741447892 3
adventure 260 Star Wars: Episode IV - A New Hope (1977) 4.453694416583082 4
adventure 1204 Lawrence of Arabia (1962) 4.401925391095066 5
9. 各年评分最高的电影类型(年份,类型,影评分)
----分析
--每一年 每一类电影的平均分
分组:年 电影类型 ->movies表
求:avg ->ratings表
--1. 炸裂 类型
select
MovieID,substr(Title,-5,4) year,tps.tp type
from movies lateral view explode(split(Genres,"\\|")) tps as tp;
--
movieid year type
1 1995 Animation
1 1995 Children's
1 1995 Comedy
2 1995 Adventure
2 1995 Children's
--2. 关联 求每一年 每一类电影的平均分
select
a.year year,lower(a.type) type,avg(b.Rating) avgrate
from
(select
MovieID,substr(Title,-5,4) year,tps.tp type
from movies lateral view explode(split(Genres,"\\|")) tps as tp) a
join ratings b on a.MovieID=b.MovieID
group by a.year,lower(a.type);
--
year type avgrate
1919 action 2.5
1919 adventure 2.6666666666666665
1919 comedy 3.6315789473684212
1919 drama 2.5714285714285716
1920 comedy 3.6666666666666665
1921 action 3.7903225806451615
1922 horror 3.991596638655462
--3. 分组topN
局部:分组 年
排序: avgrate desc
select * from
(select
year,type,avgrate,row_number() over(distribute by year sort by avgrate desc) index
from (
select
a.year year,lower(a.type) type,avg(b.Rating) avgrate
from
(select
MovieID,substr(Title,-5,4) year,tps.tp type
from movies lateral view explode(split(Genres,"\\|")) tps as tp) a
join ratings b on a.MovieID=b.MovieID
group by a.year,lower(a.type)
) c) d where index=1;
-----
d.year d.type d.avgrate d.index
1919 comedy 3.6315789473684212 1
1920 comedy 3.6666666666666665 1
1921 action 3.7903225806451615 1
1922 horror 3.991596638655462 1
1923 comedy 3.4444444444444446 1
1925 comedy 4.124223602484472 1
1926 crime 4.5 1
1927 comedy 4.368932038834951 1
1928 comedy 3.888888888888889 1
10. 每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
---分析
--每个地区 每一部电影的平均评分
分组:地区(邮编)->users表 电影id->ratings表 电影名->movies表
求:平均分 ->ratings表
select
a.Zipcode zipcode,b.MovieID MovieID,avg(b.Rating) avgrate
from users a join ratings b on a.UserID=b.UserID join
movies c on b.MovieID=c.MovieID
group by a.zipcode,b.MovieID,c.Title;
--
zipcode movieid avgrate
00231 3 4.0
00231 7 5.0
00231 11 4.0
00231 17 3.0
00231 39 4.0
00231 46 3.0
--每个地区评分最高的 分组取TopN
分组:地区
排序: 评分 desc
select * from
(select
zipcode,movieid,avgrate,row_number() over(partition by zipcode order by avgrate desc) index
from (
select
a.Zipcode zipcode,b.MovieID MovieID,avg(b.Rating) avgrate
from users a join ratings b on a.UserID=b.UserID join
movies c on b.MovieID=c.MovieID
group by a.zipcode,b.MovieID,c.Title
) d ) e where index=1;
----
e.zipcode e.movieid e.avgrate e.index
00231 2006 5.0 1
00606 3937 5.0 1
00681 3793 5.0 1
00693 2916 5.0 1
00918 2706 5.0 1