文章目录

  • 数据描述
  • 数据要求
  • 题目
  • 1. 正确建表,导入数据(三张表,三份数据),并验证是否正确
  • 2. 求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
  • 3. 分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
  • 4. 求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
  • 5. 求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
  • 6. 求好片(评分>=4.0)最多的那个年份(上映)的最好看的10部电影
  • 7. 求1997年上映的电影中,评分最高的10部Comedy类电影
  • 8. 该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
  • 9. 各年评分最高的电影类型(年份,类型,影评分)
  • 10. 每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)


数据描述

  1. users.dat 用户表
    数据格式为: 2::M::56::16::70072
    对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
    对应字段中文解释:用户id,性别,年龄,职业,邮政编码
  2. movies.dat 电影表
    数据格式为: 2::Jumanji (1995)::Adventure|Children’s|Fantasy
    对应字段为:MovieID BigInt, Title String, Genres String
    对应字段中文解释:电影ID,电影名字,电影类型
  3. ratings.dat 评分表
    数据格式为: 1::1193::5::978300760
    对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
    对应字段中文解释:用户ID,电影ID,评分,评分时间戳

数据要求

(1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析’:’, 不支持解析’::’,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)

(2)使用Hive能解析的方式进行
要求:外部表
hdfs: /user/data/yingping/movies ratings users

题目

-- 设置本地模式 及 显示表头
hive> set hive.exec.mode.local.auto=true;
hive> set hive.cli.print.header=true;
-- hdfs 数据目录创建
hive> dfs -mkdir -p /user/data/yingping/movies;
hive> dfs -mkdir -p /user/data/yingping/ratings;
hive> dfs -mkdir -p /user/data/yingping/users;
-- hdfs 数据目录 上传数据
hive> dfs -put users.dat /user/data/yingping/users;
hive> dfs -put movies.dat /user/data/yingping/movies;
hive> dfs -put ratings.dat /user/data/yingping/ratings;
-- 创建yingping数据库
hive> create database yingping;
hive> use yingping;

1. 正确建表,导入数据(三张表,三份数据),并验证是否正确

-- 创建表 movies
create external table if not exists movies(
MovieID BigInt, Title String, Genres String) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s') stored as textfile 
location '/user/data/yingping/movies';

hive> select * from movies limit 2;
OK
movies.movieid	movies.title	movies.genres
1	Toy Story (1995)	Animation|Children's|Comedy
2	Jumanji (1995)	Adventure|Children's|Fantasy
Time taken: 1.142 seconds, Fetched: 2 row(s)

-- 创建表 ratings
create external table if not exists ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) 
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s') stored as textfile 
location '/user/data/yingping/ratings';

hive> select * from ratings limit 2;
OK
ratings.userid	ratings.movieid	ratings.rating	ratings.timestamped
1	1193	5.0	978300760
1	661	3.0	978302109
Time taken: 0.145 seconds, Fetched: 2 row(s)

-- 创建表 users
create external table if not exists users(userid bigint,gender string,age int,occupation string,zipcode string) row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe' 
with serdeproperties('input.regex'='(.*)::(.*)::(.*)::(.*)::(.*)','output.format.string'='%1$s %2$s %3$s %4$s %5$s') stored as textfile location '/user/data/yingping/users';

hive> select * from users limit 2;
OK
users.userid	users.gender	users.age	users.occupation	users.zipcode
1	F	1	10	48067
2	M	56	16	70072
Time taken: 0.137 seconds, Fetched: 2 row(s)

2. 求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)

----分析
--也就是要先求出,每一部电影总评分次数 排序
分组:电影id title ->movies表
求:count() ->ratings表
排序:count() desc limit 10

select 
a.Title,count(b.Rating) totalRate
from movies a join ratings b
on a.MovieID=b.MovieID
group by a.MovieID,a.Title
order by totalRate desc limit 10;

--结果
a.title	totalrate
American Beauty (1999)	3428
Star Wars: Episode IV - A New Hope (1977)	2991
Star Wars: Episode V - The Empire Strikes Back (1980)	2990
Star Wars: Episode VI - Return of the Jedi (1983)	2883
Jurassic Park (1993)	2672
Saving Private Ryan (1998)	2653
Terminator 2: Judgment Day (1991)	2649
Matrix, The (1999)	2590
Back to the Future (1985)	2583
Silence of the Lambs, The (1991)	2578

3. 分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)

方法一:分开求

----分析
--1.求男性
过滤条件:男性 ->users表
每一部电影的平均评分
分组:电影Id 电影名 ->movies表
求:avg(Rating) -> ratings表
排序:avg(Rating) desc limit 10

select
c.Gender Gender,
a.Title Title,
avg(b.Rating) avgrate 
from movies a join ratings b on a.MovieID=b.MovieId
join users c on b.UserID=c.UserID
where c.Gender='M'
group by a.MovieID,a.Title,c.Gender
order by avgrate desc limit 10;

--结果
gender	title	avgrate
M	Small Wonders (1996)	5.0
M	Smashing Time (1967)	5.0
M	Baby, The (1973)	5.0
M	Gate of Heavenly Peace, The (1995)	5.0
M	Schlafes Bruder (Brother of Sleep) (1995)	5.0
M	Dangerous Game (1993)	5.0
M	Follow the Bitch (1998)	5.0
M	Bells, The (1926)	5.0
M	Lured (1947)	5.0
M	Angela (1995)	5.0

--2.求女性
select 
c.Gender Gender,a.Title Title,avg(b.Rating) avgrate 
from 
movies a join ratings b 
on a.MovieID=b.MovieID join users c on b.UserID=c.UserID 
where c.Gender='F' 
group by a.MovieID,a.Title,c.Gender 
order by avgrate desc limit 10;

--结果
gender	title	avgrate
F	For the Moment (1994)	5.0
F	Gate of Heavenly Peace, The (1995)	5.0
F	Skipped Parts (2000)	5.0
F	Raw Deal (1948)	5.0
F	Clean Slate (Coup de Torchon) (1981)	5.0
F	Coldblooded (1995)	5.0
F	Country Life (1994)	5.0
F	Message to Love: The Isle of Wight Festival (1996)	5.0
F	Bittersweet Motel (2000)	5.0
F	Lamerica (1994)	5.0

方法二:分析函数

----分析
--每个性别中评分最高的10部电影

--每个性别中每部电影的评分
分组:Gender MovieID Title
求:avg
select 
c.Gender Gender,a.Title Title,avg(b.Rating) avgrate 
from movies a join ratings b on a.MovieID=b.MovieID 
join users c on b.UserID=c.UserID 
group by c.Gender,a.MovieID,a.Title;

--分组求topN
局部排序
分组:Gender
排序: avgrate
select * from 
(select 
Gender,Title,avgrate,row_number() over(distribute by Gender sort by avgrate desc) index 
from 
(select 
c.Gender Gender,a.Title Title,avg(b.Rating) avgrate 
from movies a join ratings b on a.MovieID=b.MovieID 
join users c on b.UserID=c.UserID 
group by c.Gender,a.MovieID,a.Title
) d) e where index<=10;

--结果
e.gender	e.title	e.avgrate	e.index
F	Brother, Can You Spare a Dime? (1975)	5.0	1
F	Woman of Paris, A (1923)	5.0	2
F	Ayn Rand: A Sense of Life (1997)	5.0	3
F	One Little Indian (1973)	5.0	4
F	Song of Freedom (1936)	5.0	5
F	Belly (1998)	5.0	6
F	Gambler, The (A J�t�kos) (1997)	5.0	7
F	Clean Slate (Coup de Torchon) (1981)	5.0	8
F	Country Life (1994)	5.0	9
F	Other Side of Sunday, The (S�ndagsengler) (1996)	5.0	10
M	Bells, The (1926)	5.0	1
M	Dangerous Game (1993)	5.0	2
M	Follow the Bitch (1998)	5.0	3
M	Small Wonders (1996)	5.0	4
M	Schlafes Bruder (Brother of Sleep) (1995)	5.0	5
M	Angela (1995)	5.0	6
M	Baby, The (1973)	5.0	7
M	Ulysses (Ulisse) (1954)	5.0	8
M	Gate of Heavenly Peace, The (1995)	5.0	9
M	Smashing Time (1967)	5.0	10

4. 求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)

--分析
过滤:movieID=2116 -> ratings表
分组:年龄 -> users表
求:avg ->ratings表

select
a.Age,avg(b.Rating) avgrate
from users a join ratings b
on a.UserID=b.UserID
where b.MovieID=2116
group by a.Age;

--结果
a.age	avgrate
1	3.2941176470588234
18	3.3580246913580245
25	3.436548223350254
35	3.2278481012658227
45	2.8275862068965516
50	3.32
56	3.5

5. 求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)

首先这个题目可能有点歧义:
1.求,这个女性评分最高的10部电影的总平均分
2. 求,这个女生评分最高的10部电影的,每一电影的平均分

--如果是2.这个女生评分最高的10部电影的,每一电影的平均分
----分析
--1) 先找到这位女性
过虑:Gender="F" ->users表
分组:UserId ->ratings表
求:count() ->ratings表
排序:count() desc limit 1

create table answer05_step01 as 
select
a.UserID UserID,count(a.Rating) totalRate
from ratings a join users b 
on a.UserID=b.UserID
where b.Gender="F"
group by a.UserID
order by totalRate desc limit 1;

--结果
userid	totalrate
1150	1302

--2) 然后这个女性最喜欢的10部电影
--这个女生评分最高的10部电影
过滤:女性 1150 ->answer05_step01表
排序:评分 desc limit10 ->ratings表
create table answer05_step02 as 
select 
a.UserID UserID,a.MovieID MovieID,a.Rating Rating 
from ratings a join answer05_step01 b on 
a.UserID=b.UserID 
order by Rating desc limit 10;
----****注意  where子查询运用一个查询结果问题***

--结果(这位女性评分最高的10部电影)
hive> select * from answer05_step02;
OK
answer05_step02.userid	answer05_step02.movieid	answer05_step02.rating
1150	745		5.0
1150	1279	5.0
1150	1236	5.0
1150	904		5.0
1150	750		5.0
1150	2997	5.0
1150	2064	5.0
1150	905		5.0
1150	1094	5.0
1150	1256	5.0
Time taken: 0.128 seconds, Fetched: 10 row(s)

--3) 她喜欢的10部电影的平均影评 (每一部电影的平均影评)
分组:MovieID ->answer05_step02表
求 avg -> ratings表
select 
a.MovieID MovieID,avg(b.Rating) avgrate 
from answer05_step02 a left join ratings b 
on a.MovieID=b.MovieID 
group by a.MovieID;
--结果
movieid	avgrate
745		4.52054794520548
750		4.4498902706656915
904		4.476190476190476
905		4.280748663101604
1094	3.7314890154597236
1236	4.188888888888889
1256	4.21043771043771
1279	3.747422680412371
2064	4.0739348370927315
2997	4.125390450691656

----上面3步联合查询就是
select 
a.MovieID MovieID,avg(b.Rating) avgrate 
from 
(
select 
a.UserID UserID,a.MovieID MovieID,a.Rating Rating 
from ratings a join 
(
select
a.UserID UserID,count(a.Rating) totalRate
from ratings a join users b 
on a.UserID=b.UserID
where b.Gender="F"
group by a.UserID
order by totalRate desc limit 1
) b on 
a.UserID=b.UserID 
order by Rating desc limit 10
) 
a left join ratings b 
on a.MovieID=b.MovieID 
group by a.MovieID;

--结果
movieid	avgrate
162		4.063136456211812
904		4.476190476190476
951		4.249370277078086
1230	4.14167916041979
1966	3.6464646464646466
2330	4.163043478260869
3163	3.7039473684210527
3307	4.387453874538745
3671	4.047363717605005
3675	3.8265682656826567


---联合查询时候,会替换上边儿分步表,那我们实验一下,表别名重复影响不影响?
--实验发现他们局部表名,不影响
select 
a.MovieID MovieID,avg(b.Rating) avgrate 
from 
(
select 
c.UserID UserID,c.MovieID MovieID,c.Rating Rating 
from ratings c join 
(
select 
e.UserID UserID,count(e.Rating) totalRate 
from ratings e join users f on e.UserID=f.UserID 
where f.Gender="F" 
group by e.UserID 
order by totalRate desc limit 1
) d 
on 
c.UserID=d.UserID 
order by Rating desc limit 10
) a left join ratings b 
on a.MovieID=b.MovieID 
group by a.MovieID;

--结果
movieid	avgrate
162	4.063136456211812
904	4.476190476190476
951	4.249370277078086
1230	4.14167916041979
1966	3.6464646464646466
2330	4.163043478260869
3163	3.7039473684210527
3307	4.387453874538745
3671	4.047363717605005
3675	3.8265682656826567


---但是,我发现联合查询与分步查询的结果不一致???

6. 求好片(评分>=4.0)最多的那个年份(上映)的最好看的10部电影

----分析
--好片最多的那个年份?
--每一部电影的平均评分
--每一年评分>=4.0的电影部数   按这个部数排序 desc limit1

--1.每一年每一部电影的平均评分
分组:年份	电影 -> movies表
求: avg(Ratings) ->ratings表
--电影表数据是这样的,电影名称截取年份
movies.movieid	movies.title	movies.genres
1	Toy Story (1995)	Animation|Children's|Comedy

create table year_movie_avgrate as 
select
substr(a.Title,-5,4) year,a.MovieID MovieID,a.Title Title,avg(b.Rating) avgrate 
from movies a join ratings b
on a.MovieID=b.MovieID
group by a.MovieID,a.Title,substr(a.Title,-5,4);
--year_movie_avgrate 存入数据样例
year	movieid	title	avgrate
1995	1	Toy Story (1995)	4.146846413095811
1995	2	Jumanji (1995)	3.20114122681883
1995	3	Grumpier Old Men (1995)	3.01673640167364


--2. 好片最多的那个年份
分组:年份
过滤 >=4.0
求:count()
排序:count() desc limit 1

select 
year,count(*) totalMovie 
from year_movie_avgrate 
where avgrate>=4.0 
group by year 
order by totalMovie desc limit 1;
--结果
year	totalmovie
1998	27

--3. 求这一年最好看的10部电影
过滤:1998 ->year_movie_avgrate表
排序:评分 倒序 limit 10
select 
b.MovieID MovieID,b.Title Title,b.avgrate avgrate 
from (select 
year,count(*) totalMovie 
from year_movie_avgrate 
where avgrate>=4.0 
group by year 
order by totalMovie desc limit 1) a 
left join year_movie_avgrate b 
on a.year=b.year 
order by avgrate desc limit 10;
--结果
movieid	title	avgrate
1830	Follow the Bitch (1998)	5.0
2503	Apple, The (Sib) (1998)	4.666666666666667
2309	Inheritors, The (Die Siebtelbauern) (1998)	4.5
2930	Return with Honor (1998)	4.4
2028	Saving Private Ryan (1998)	4.337353938937053
2360	Celebration, The (Festen) (1998)	4.3076923076923075
2839	West Beirut (West Beyrouth) (1998)	4.3
2357	Central Station (Central do Brasil) (1998)	4.283720930232558
3077	42 Up (1998)	4.2272727272727275
2329	American History X (1998)	4.2265625

7. 求1997年上映的电影中,评分最高的10部Comedy类电影

----分析
year_movie_avgrate表:  year movieid  title avgrate
过滤:1997& Comedy  -> movies表
排序:评分	desc limit 10

select 
a.year year,a.MovieID MovieID,a.Title Title,a.avgrate avgrate,b.Genres Genres 
from year_movie_avgrate a join movies b 
on a.MovieID=b.MovieID 
where a.year="1997" and instr(lower(b.Genres),"comedy")>0 
order by avgrate desc limit 10;

--结果
year	movieid	title	avgrate	genres
1997	2324	Life Is Beautiful (La Vita � bella) (1997)	4.329861111111111	Comedy|Drama
1997	2444	24 7: Twenty Four Seven (1997)	4.0	Comedy|Drama
1997	1827	Big One, The (1997)	4.0	Comedy|Documentary
1997	1871	Friend of the Deceased, A (1997)	4.0	Comedy|Drama
1997	1784	As Good As It Gets (1997)	3.9501404494382024	Comedy|Drama
1997	2618	Castle, The (1997)	3.891304347826087	Comedy
1997	1641	Full Monty, The (1997)	3.872393661384487	Comedy
1997	1564	Roseanna's Grave (For Roseanna) (1997)	3.8333333333333335	Comedy|Romance
1997	1734	My Life in Pink (Ma vie en rose) (1997)	3.825870646766169	Comedy|Drama
1997	1500	Grosse Pointe Blank (1997)	3.813380281690141	Comedy|Crime

8. 该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)

----分析
分组:类型 MovieID ->movies表
求:avg ->ratings表
--1. 炸裂,电影类型
select 
MovieID,Title,tps.tp type 
from movies 
lateral view explode(split(Genres,"\\|")) tps as tp; 
-- 数据
movieid	title	type
1	Toy Story (1995)	Animation
1	Toy Story (1995)	Children's
1	Toy Story (1995)	Comedy
2	Jumanji (1995)	Adventure
2	Jumanji (1995)	Children's

--2. 关联求 每种类型 每一部电影的平均分
select 
lower(a.type) type,a.MovieID MovieID,a.Title Title,avg(b.Rating) avgrate 
from (
select 
MovieID,Title,tps.tp type 
from movies 
lateral view explode(split(Genres,"\\|")) tps as tp
) a join ratings b
on a.MovieID=b.MovieID
group by lower(a.type),a.MovieID,a.Title;

--结果
type	movieid	title	avgrate
action	6	Heat (1995)	3.8787234042553194
action	9	Sudden Death (1995)	2.656862745098039
action	10	GoldenEye (1995)	3.5405405405405403
action	15	Cutthroat Island (1995)	2.458904109589041
action	20	Money Train (1995)	2.5375

--3. 分组求topN
分组:type
排序:avgrate desc  

select * from(
select 
c.type type,c.MovieID MovieID,c.Title Title,c.avgrate avgrate,row_number() over(partition by c.type order by c.avgrate desc) index 
from(
select 
lower(a.type) type,a.MovieID MovieID,a.Title Title,avg(b.Rating) avgrate 
from (
select 
MovieID,Title,tps.tp type 
from movies 
lateral view explode(split(Genres,"\\|")) tps as tp
) a join ratings b
on a.MovieID=b.MovieID
group by lower(a.type),a.MovieID,a.Title
) c
)d where index<=5;

--结果:
d.type	d.movieid	d.title	d.avgrate	d.index
action	2905	Sanjuro (1962)	4.608695652173913	1
action	2019	Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)	4.560509554140127	2
action	858	Godfather, The (1972)	4.524966261808367	3
action	1198	Raiders of the Lost Ark (1981)	4.477724741447892	4
action	260	Star Wars: Episode IV - A New Hope (1977)	4.453694416583082	5
adventure	3172	Ulysses (Ulisse) (1954)	5.0	1
adventure	2905	Sanjuro (1962)	4.608695652173913	2
adventure	1198	Raiders of the Lost Ark (1981)	4.477724741447892	3
adventure	260	Star Wars: Episode IV - A New Hope (1977)	4.453694416583082	4
adventure	1204	Lawrence of Arabia (1962)	4.401925391095066	5

9. 各年评分最高的电影类型(年份,类型,影评分)

----分析
--每一年 每一类电影的平均分
分组:年 电影类型 ->movies表
求:avg ->ratings表
--1. 炸裂 类型
select 
MovieID,substr(Title,-5,4) year,tps.tp type 
from movies lateral view explode(split(Genres,"\\|")) tps as tp;
--
movieid	year	type
1	1995	Animation
1	1995	Children's
1	1995	Comedy
2	1995	Adventure
2	1995	Children's

--2. 关联 求每一年 每一类电影的平均分
select 
a.year year,lower(a.type) type,avg(b.Rating) avgrate 
from 
(select 
MovieID,substr(Title,-5,4) year,tps.tp type 
from movies lateral view explode(split(Genres,"\\|")) tps as tp) a 
join ratings b on a.MovieID=b.MovieID 
group by a.year,lower(a.type);

--
year	type	avgrate
1919	action	2.5
1919	adventure	2.6666666666666665
1919	comedy	3.6315789473684212
1919	drama	2.5714285714285716
1920	comedy	3.6666666666666665
1921	action	3.7903225806451615
1922	horror	3.991596638655462

--3. 分组topN
局部:分组   年
排序:  avgrate  desc 
select * from 
(select 
year,type,avgrate,row_number() over(distribute by year sort by avgrate desc) index 
from (
select 
a.year year,lower(a.type) type,avg(b.Rating) avgrate 
from 
(select 
MovieID,substr(Title,-5,4) year,tps.tp type 
from movies lateral view explode(split(Genres,"\\|")) tps as tp) a 
join ratings b on a.MovieID=b.MovieID 
group by a.year,lower(a.type)
) c) d where index=1;
-----
d.year	d.type	d.avgrate	d.index
1919	comedy	3.6315789473684212	1
1920	comedy	3.6666666666666665	1
1921	action	3.7903225806451615	1
1922	horror	3.991596638655462	1
1923	comedy	3.4444444444444446	1
1925	comedy	4.124223602484472	1
1926	crime	4.5	1
1927	comedy	4.368932038834951	1
1928	comedy	3.888888888888889	1

10. 每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)

---分析
--每个地区 每一部电影的平均评分
分组:地区(邮编)->users表 电影id->ratings表 电影名->movies表
求:平均分 ->ratings表

select 
a.Zipcode zipcode,b.MovieID MovieID,avg(b.Rating) avgrate 
from users a join ratings b  on a.UserID=b.UserID join 
movies c on b.MovieID=c.MovieID 
group by a.zipcode,b.MovieID,c.Title;

--
zipcode	movieid	avgrate
00231	3	4.0
00231	7	5.0
00231	11	4.0
00231	17	3.0
00231	39	4.0
00231	46	3.0

--每个地区评分最高的 分组取TopN
分组:地区
排序: 评分 desc
select * from 
(select 
zipcode,movieid,avgrate,row_number() over(partition by zipcode order by avgrate desc) index 
from (
select 
a.Zipcode zipcode,b.MovieID MovieID,avg(b.Rating) avgrate 
from users a join ratings b  on a.UserID=b.UserID join 
movies c on b.MovieID=c.MovieID 
group by a.zipcode,b.MovieID,c.Title
) d ) e where index=1;

----
e.zipcode	e.movieid	e.avgrate	e.index
00231	2006	5.0	1
00606	3937	5.0	1
00681	3793	5.0	1
00693	2916	5.0	1
00918	2706	5.0	1