接上一篇ETL流程化处理数据,本篇基于ETL生成的数据进行清洗和分析…清洗不太会,没有洗衣粉,将就看HSQL吧,主要是开窗函数和优先级之间的关系,剩下的就是语法,就是干…MD
数据格式请看上一篇
两个表 一个是use用户信息表,一个是content评论表,两个维度表
Table格式在这里
这个是Content表,words字段是评论区的内容,大部分都是以@分割的评论或转发内容,lotime字段是时间,其他字段不重要,city和location一个是评论数一个是转发数,这里是谁不重要,只是映射关键是技术的掌握,当然除了在表中搜索还可以split分割匹配size转发数量和评论数量,高级一点别太low~
这个是use用户表关键的是两个数据,1,id,inner join关联时用,remark分区名称,微博可能叫话题把,俺也不懂
this1表部分数据
this2表部分数据
大致了解数据字典的内容后我们开始做题吧,
先安利一个自己的脚本,hive的操作实在是太让人眼花了,干了6个小时眼睛都快瞎了…
看代码
#!/bin/sh
sql=$1
#2.open databelse
echo '++++++start hive database wangtianxin++++++'
echo '++++++write sql1++++++'
hive -e "use 你的数据库;$sql;" > e1.txt
clear
cat e1.txt
echo '----------end--------'
使用教程
回车
结果,结束后可以干别的,不用看hive折磨人的界面
不过没有解决shell的特殊字符拼接问题,可能要用commen的那个jar包,反编译特殊字符还是太麻烦了,不过简单的语句别搞split还是可以让你爽的哈哈哈哈
微博数据分析
1、(基础分析题)各频道参与博主的人数分布排行,求top5
select num,remark,rank from(select num,remark,dense_rank() over(order by num desc) as rank from(select count(1) num,remark from this2 where remark!=‘null’ group by remark)as aaa) as a where rank <=5;
2、(基础分析题)各频道评论量排行,求top5
select remark,num,num1 from(select remark,num,dense_rank() over(order by num) as num1 from(select remark,sum(num1) as num from (select remark,num1 from (select uid, count(*) num1 from this1 group by uid) as a inner join this2 on a.uid = this2.id )a group by remark) a) a where num1<=5;
3、(基础分析题)各频道评论转发量排行,求top5
函数测试
String测试
数组测试
select remark,rk from (select remark ,rank() over(order by a desc)as rk from(select remark,sum(a) as a from (select a,remark from(select uid,sum(city+location) as a from this1 group by uid)as a inner join this2 on this2.id=a.uid) as a group by remark)as a)as a where rk<=5;
table1 select uid,sum(size(split(words, ‘//@’))-1) as nu from this1 group by uid
table2 select remark ,nu from as table1 inner join this2 on table1.uid=this2.id
tbale3 select remark,sum(nu) as n from table2 group by remark
table3 select remark,n,rank() over(order by remark desc) as rk from table2
table4 select remark ,rk ,num from table3 where rk<=5select remark ,nu from (select uid,sum(size(split(words, ‘//@’))-1) as nu from this1 group by uid) as table1 inner join this2 on table1.uid=this2.id
select remark,sum(nu) as n from (select remark ,nu from (select uid,sum(size(split(words, ‘//@’))-1) as nu from this1 group by uid) as table1 inner join this2 on table1.uid=this2.id) as table2 group by remark
select remark,n,rank() over(order by remark desc) as rk from (select remark,sum(nu) as n from (select remark ,nu from (select uid,sum(size(split(words, ‘//@’))-1) as nu from this1 group by uid) as table1 inner join this2 on table1.uid=this2.id) as table2 group by remark
) as aselect remark ,rk from (select remark,n,rank() over(order by remark desc) as rk from (select remark,sum(nu) as n from (select remark ,nu from (select uid,sum(size(split(words, ‘//@’))-1) as nu from this1 group by uid) as table1 inner join this2 on table1.uid=this2.id) as table2 group by remark) as a)aa where rk<=5
4、(基础分析题)各频道博文评论的回复(再回复)量排行,求top5
select remark ,rk from (select remark,n,rank() over(order by remark desc) as rk from (select remark,sum(nu) as n from (select remark ,nu from (select uid,sum(size(split(words, ‘回复@’))-1) as nu from this1 group by uid) as table1 inner join this2 on table1.uid=this2.id) as table2 group by remark) as a)aa where rk<=5
5、(综合分析题)各频道活跃度排行
***先定义合理的计算口径,然后写对应的SQL语句
step1 : 候选影响因子集合找到
博主参与量,评论量,评论转发量,回复量(评论内容质量)
step2 : 拟定一个计算公式
影响因子的权重:基准为1=评论数量即为1,
有博主参与权重1.1,
有转发的权重:2
回复量的权重:1.5
计算口径
number:参与话题人数
t1:评论量
t2:回复量
select remark,count(uid) as number,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by remark
select remark ,number,rank() over(order by number1.1+t11.5+t2*2 desc) as rk from (select remark,count(uid) as number,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by remark) as a
6、(综合分析题)各频道的博主的影响力排行,求top5
***先定义合理的计算口径,然后写对应的SQL语句
Select id,remark,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by id,remark
Select id,remark,sum(t1+t2) over(partition by id) as af from (Select id,remark,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by id,remark)a
Select id,remark,sum(t1+t2) over(partition by id) as af from (Select id,remark,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by id,remark)a
Select remark,id,af,rank() over(partition by remark order by af) from (Select id,remark,sum(t1+t2) over(partition by id) as af from (Select id,remark,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by id,remark)a)a;
Select remark,id,af,rank() over(partition by remark order by af desc) as rk from (Select id,remark,sum(t1+t2) over(partition by id) as af from (Select id,remark,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by id,remark)a)a;
Select remark,id,af,rk from(Select remark,id,af,rank() over(partition by remark order by af desc) as rk from (Select id,remark,sum(t1+t2) over(partition by id) as af from (Select id,remark,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 from this1 inner join this2 on this2.id=this1.uid group by id,remark)a)a)as a where rk <=5
7、(综合分析题)各频道的博文的影响力排行,求top5
***先定义合理的计算口径,然后写对应的SQL语句
Select sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 , uid from this1 group by words,uid
假设数据干净,没有脏数据重复博文,和多个评论分区,开始计算博文权重
Select uid,t1,t2,remark from(Select sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 , uid from this1 group by words,uid)as a inner join this2 on a.uid=this2.id
Select remark,uid,sum(t1+t2) over(partition by words)as rf from (Select words,uid,t1,t2,remark from(Select words,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 , uid from this1 group by words,uid)as a inner join this2 on a.uid=this2.id)as a
Select remark, uid,rf,dense_rank() over(partition by remark order by rf desc) as rka from(Select words, remark,uid,sum(t1+t2) over(partition by words)as rf from (Select words,uid,t1,t2,remark from(Select words,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 , uid from this1 group by words,uid)as a inner join this2 on a.uid=this2.id)as a)as a
Select remark,uid,rf,rka from(Select remark, uid,rf,rank() over(partition by remark order by rf desc) as rka from(Select words, remark,uid,sum(t1+t2) over(partition by words)as rf from (Select words,uid,t1,t2,remark from(Select words,sum(size(split(words, ‘//@’))-1) as t1, sum(size(split(words, ‘回复@’))-1) as t2 , uid from this1 group by words,uid)as a inner join this2 on a.uid=this2.id)as a)as a)as a where rka <=5
8、(综合分析题)各频道的博主的行为特征分析之每天的活跃时间段分析,
求各频道每天最活跃时间段top3,以小时为单位即可。
***先定义合理的计算口径,然后写对应的SQL语句
select uid,sum(size(split(words, ‘//@’))-1) over(partition by uid) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by uid) as t2 ,hour(lotime) as time from this1 ;
Select remark, uid, t1, t2, hotime from(select uid,sum(size(split(words, ‘//@’))-1) over(partition by uid) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by uid) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id
用id分组不对,要用时间分组
Select remark, uid, t1, t2, hotime from(select uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id
这样就对了然后在分组就行
Select remark, uid, t1, t2, hotime,words from(select words,uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id
缺少了remark判定每一行的唯一条件,只能补上,眼瞎了快!
Select remark , uid,sum(t1+t2) over(partition by words),hotime from(Select remark, uid, t1, t2, hotime,words from(select words,uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id) as a
Select remark,uid,rf,hotime,rank() over(order by remark,hotime) from(Select remark , uid,sum(t1+t2) over(partition by words) as rf,hotime from(Select remark, uid, t1, t2, hotime,words from(select words,uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id) as a)as a
按照分区和时间排序
按照时间分组将rf合并
Select remark,uid,sum(rf) over(partition by remark,hotime) as tm,hotime,rk1 from (Select remark,uid,rf,hotime,rank() over(order by remark,hotime) as rk1 from(Select remark , uid,sum(t1+t2) over(partition by words) as rf,hotime from(Select remark, uid, t1, t2, hotime,words from(select words,uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id) as a)as a)as a
按照时间rank
Select hotime ,remark,tm,rank() over(partition by remark order by tm desc) from(Select distinct hotime ,remark,tm from(Select remark,uid,sum(rf) over(partition by remark,hotime) as tm,hotime,rk1 from (Select remark,uid,rf,hotime,rank() over(order by remark,hotime) as rk1 from(Select remark , uid,sum(t1+t2) over(partition by words) as rf,hotime from(Select remark, uid, t1, t2, hotime,words from(select words,uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id) as a)as a)as a)as a)as a
选出前五
Select hotime ,remark,tm, rk from(Select hotime ,remark,tm,rank() over(partition by remark order by tm desc) as rk from(Select distinct hotime ,remark,tm from(Select remark,uid,sum(rf) over(partition by remark,hotime) as tm,hotime,rk1 from (Select remark,uid,rf,hotime,rank() over(order by remark,hotime) as rk1 from(Select remark , uid,sum(t1+t2) over(partition by words) as rf,hotime from(Select remark, uid, t1, t2, hotime,words from(select words,uid,sum(size(split(words, ‘//@’))-1) over(partition by lotime) as t1, sum(size(split(words, ‘回复@’))-1) over(partition by lotime) as t2 ,hour(lotime) as hotime from this1 )as a inner join this2 on a.uid=this2.id) as a)as a)as a)as a)as a)as a where rk <=5
看体育的都是饭点,看公安的都是白天,估计是犯事了,看育儿的都是上午,估计是孩子醒了,看风水的都是早九点到下午3点,中午开始算命?,看军事的都是在中午,估计是中午才出当天最新的军事新闻