数据清洗

  • 创建项目
  • 1、第一步导入json
  • 2、上传csv表格至hdfs
  • 3、创建hdfs文件夹
  • 4、上传表格
  • 5、清理库
  • 6、修改interpreter
  • 7、创建原始数据表并且上传csv文件
  • 数据问题分析.
  • 表格处理
  • 问题1:8001-8100数据的有重复
  • 问题2:过滤掉store_review中没有评分的数据
  • 问题3:credit_no的加密
  • 问题4:transaction数据按照日期YYYY-MM做分区


csv处理。 重新建表清洗之后:

问题2:过滤掉store_review中没有评分的数据

解决方案:清理无用数据并把数据重新导入至新表
代码实现
首先创建新表以接收数据清洗之后的数据:

建表
create table store_review

%hive
create table if not exists store_review(
 transaction_id string,
 store_id string,
 review_score string
)
row format delimited fields terminated by ','
stored as textfile

清理无用数据
思路

select * from ext_store_review s where  
exists (select * from ext_transaction_details t where cast(s.transaction_id as int)=cast(t.transaction_id as int) and cast(s.store_id as int)=cast(t.store_id as int)) and s.review_score !=""

如图就将所需的有评论并且id对应双表都有的数据筛选出来的

hdfs到hive hdfs到hive数据清洗_hdfs到hive


在写hql语句时候,并不止一个写法,但是第一个写法是优化的:

select * from (select * from ext_store_review s where exists(select * from ext_transaction_details t where s.transaction_id=t.transaction_id )) as a where a.review_score >0

接下来将数据导入:

insert overwrite table store_review select * from ext_store_review s where  
exists 
(select * from ext_transaction_details t where cast(s.transaction_id as int)=cast(t.transaction_id as int) 
and 
cast(s.store_id as int)=cast(t.store_id as int)) and s.review_score !=""

问题3:credit_no的加密

解决方案:md5单向加密,当然有需求的话可以采用双向的base64式双向加密解密法.
代码实现

%hive
create table if not exists customer_details as
select customer_id,first_name,last_name,email,gender,address,country,job,credit_type,md5(credit_no) credit_no from ext_customer_details

问题4:transaction数据按照日期YYYY-MM做分区

解决方案 开启分区 并且按照年月分区
代码实现

set hive.exec.dynamic.partition=true   //开启动态分区

insert overwrite table transaction_details partition(year,month)
select (sq-1)*10000+transaction_id as transaction_id,
customer_id,
store_id,
price,
product,
date,
time,
year(date) as year,
month(date) as month from (select *,row_number() over(partition by transaction_id) as sq from ext_transaction_details ) t1