hdfs到hive hdfs到hive数据清洗

转载

mob6454cc7945bd 2023-07-14 16:00:05

文章标签 hdfs到hive hive csv 大数据数据分析 文章分类 Hive 大数据

数据清洗

创建项目

1、第一步导入json
2、上传csv表格至hdfs
3、创建hdfs文件夹
4、上传表格
5、清理库
6、修改interpreter
7、创建原始数据表并且上传csv文件

数据问题分析.
表格处理

问题1：8001-8100数据的有重复
问题2：过滤掉store_review中没有评分的数据
问题3：credit_no的加密
问题4：transaction数据按照日期YYYY-MM做分区

csv处理。重新建表清洗之后：

问题2：过滤掉store_review中没有评分的数据

解决方案：清理无用数据并把数据重新导入至新表
代码实现：
首先创建新表以接收数据清洗之后的数据：

建表
create table store_review

%hive
create table if not exists store_review(
 transaction_id string,
 store_id string,
 review_score string
)
row format delimited fields terminated by ','
stored as textfile

清理无用数据：
思路：

select * from ext_store_review s where  
exists (select * from ext_transaction_details t where cast(s.transaction_id as int)=cast(t.transaction_id as int) and cast(s.store_id as int)=cast(t.store_id as int)) and s.review_score !=""

如图就将所需的有评论并且id对应双表都有的数据筛选出来的

hdfs到hive hdfs到hive数据清洗_hdfs到hive

在写hql语句时候，并不止一个写法，但是第一个写法是优化的：

select * from (select * from ext_store_review s where exists(select * from ext_transaction_details t where s.transaction_id=t.transaction_id )) as a where a.review_score >0

接下来将数据导入：

insert overwrite table store_review select * from ext_store_review s where  
exists 
(select * from ext_transaction_details t where cast(s.transaction_id as int)=cast(t.transaction_id as int) 
and 
cast(s.store_id as int)=cast(t.store_id as int)) and s.review_score !=""

问题3：credit_no的加密

解决方案：md5单向加密，当然有需求的话可以采用双向的base64式双向加密解密法.
代码实现

%hive
create table if not exists customer_details as
select customer_id,first_name,last_name,email,gender,address,country,job,credit_type,md5(credit_no) credit_no from ext_customer_details

问题4：transaction数据按照日期YYYY-MM做分区

解决方案 开启分区并且按照年月分区
代码实现

set hive.exec.dynamic.partition=true   //开启动态分区

insert overwrite table transaction_details partition(year,month)
select (sq-1)*10000+transaction_id as transaction_id,
customer_id,
store_id,
price,
product,
date,
time,
year(date) as year,
month(date) as month from (select *,row_number() over(partition by transaction_id) as sq from ext_transaction_details ) t1

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。