88.1 演示环境介绍
- CM和CDH版本:5.13.1
- 已安装Impala在集群中
88.2 操作演示
拉链表设计
- 表USER,用于存储用户最新的全量信息,如下图:
- 建表:
create table user(
id bigint,
username string,
birthday timestamp
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS parquet;
- 初始数据:
INSERT INTO user values
(10001, 'fayson', '1989-08-28'),
(10002, 'zhangsan', '1979-07-28'),
(10003, 'lisi', '1980-06-18'),
(10004, 'wangwu', '1977-01-20');
- 拉链表USER_HIS,如下图:
- 建表:
- 使用分区表,是为了能够实现拉链数据的更新和删除
create table user_his(
id bigint,
username string,
birthday timestamp,
start_dt timestamp
) partitioned by (end_dt string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS parquet;
- 使用上面的表创建USER和USER_HIS表,并初始化USER表数据
拉链实现
- USER_HIS表中创建一个’9999-12-31’的分区用于存储所有用户开链数据
ALTER TABLE user_his ADD PARTITION (end_dt='9999-12-31');
- 首次USER_HIS表中无任何数据,通过USER表数据初始化拉链表USER_HIS表数据,插入所有用户的开链数据
INSERT overwrite TABLE user_his PARTITION (end_dt = '9999-12-31')
SELECT id,
username,
birthday,
from_timestamp(adddate(now(), -3), 'yyyy-MM-dd')
FROM USER;
- USER表的username修改为如下
- 为了与拉链表对比用户数据的变更
INSERT overwrite TABLE USER
SELECT id,
concat(username,'1'),
birthday
FROM USER;
- 拉链表上创建”2018-01-16”的分区
--ALTER TABLE user_his ADD PARTITION (end_dt= from_timestamp(now(), 'yyyy-MM-dd'));
ALTER TABLE user_his ADD PARTITION (end_dt= "2018-01-16");
- 将修改的USER表用户数据与USER_HIS表中开链数据比对,将可以闭链的数据插入”2018-01-16”分区
INSERT overwrite TABLE user_his PARTITION (end_dt = "2018-01-16")
SELECT b.id,
b.username,
b.birthday,
b.start_dt
FROM USER a
LEFT JOIN user_his b ON a.id=b.id
WHERE b.end_dt = '9999-12-31'
AND (a.username != b.username
OR a.birthday != b.birthday);
- 在用户表中新增一条用户信息,模拟用户表数据不存在拉链表的开链数据中
INSERT INTO user VALUES (10005, 'zhaoda', '1976-02-09');
- 更新拉链表USER_HIS的开链数据
- 包含已更新的用户、未更新用户和新增用户
INSERT overwrite TABLE user_his PARTITION(end_dt = '9999-12-31')
SELECT a.id,
a.username,
a.birthday,
b.end_dt AS start_dt
FROM USER a
LEFT JOIN user_his b ON a.id = b.id
WHERE b.end_dt = "2018-01-16"
union all
SELECT b.id,
b.username,
b.birthday,
b.start_dt
FROM user_his b
WHERE NOT EXISTS
(SELECT id
FROM user_his c
WHERE c.end_dt = "2018-01-16" and b.id = c.id)
AND b.end_dt = '9999-12-31'
union ALL
SELECT a.id,
a.username,
a.birthday,
"2018-01-16" AS start_dt
FROM USER a
WHERE NOT EXISTS
(SELECT 1
FROM user_his b
WHERE end_dt = '9999-12-31'
AND a.id = b.id);
- 模拟更新部分用户信息,验证拉链业务是否正常
- USER表数据
INSERT INTO user values
(10001, 'fayson2', '1989-09-27'),
(10002, 'zhangsan2', '1979-07-28'),
(10003, 'lisi1', '1980-06-18'),
(10004, 'wangwu1', '1977-01-20'),
(10005, 'zhaoda', '1976-02-09');
- 创建USRE_HIS表“2018-01-17”分区
ALTER TABLE user_his ADD PARTITION (end_dt= "2018-01-17");
用户的闭链数据插入到“2018-01-17”分区
INSERT overwrite TABLE user_his PARTITION (end_dt = "2018-01-17")
SELECT b.id,
b.username,
b.birthday,
b.start_dt
FROM USER a
LEFT JOIN user_his b ON a.id=b.id
WHERE b.end_dt = '9999-12-31'
AND (a.username != b.username
OR a.birthday != b.birthday);
- 根据USER和USER_HIS中“2018-01-17”分区的闭链数据,更新所有用户开链数据:
- 含新增用户、闭链用户和开链用户
INSERT overwrite TABLE user_his PARTITION(end_dt = '9999-12-31')
SELECT a.id,
a.username,
a.birthday,
b.end_dt AS start_dt
FROM USER a
LEFT JOIN user_his b ON a.id = b.id
WHERE b.end_dt = "2018-01-17"
union all
SELECT b.id,
b.username,
b.birthday,
b.start_dt
FROM user_his b
WHERE NOT EXISTS
(SELECT id
FROM user_his c
WHERE c.end_dt = "2018-01-17" and b.id = c.id)
AND b.end_dt = '9999-12-31'
union ALL
SELECT a.id,
a.username,
a.birthday,
"2018-01-17" AS start_dt
FROM USER a
WHERE NOT EXISTS
(SELECT 1
FROM user_his b
WHERE end_dt = '9999-12-31'
AND a.id = b.id);
- 完整脚本
- 执行脚本的前置条件,拉链表已存在且已创建了开链分区,脚本中将分区替换为当前日期按照每天的一次的频率执行
use test_db;
--创建当天闭链分区
ALTER TABLE user_his ADD PARTITION(end_dt= from_timestamp(now(), 'yyyy-MM-dd'));
--将闭链数据插入当天闭链分区中
INSERT overwrite TABLE user_his PARTITION(end_dt = from_timestamp(now(), 'yyyy-MM-dd'))
SELECT b.id,
b.username,
b.birthday,
b.start_dt
FROM USER a
LEFT JOIN user_his b ON a.id=b.id
WHERE b.end_dt = '9999-12-31'
AND (a.username != b.username
OR a.birthday != b.birthday);
--更新拉链表数据开链数据(包含已更新的用户、未更新用户和新增用户)
INSERT overwrite TABLE user_his PARTITION(end_dt = '9999-12-31')
SELECT a.id,
a.username,
a.birthday,
b.end_dt AS start_dt
FROM USER a
LEFT JOIN user_his b ON a.id = b.id
WHERE b.end_dt = from_timestamp(now(), 'yyyy-MM-dd')
union all
SELECT b.id,
b.username,
b.birthday,
b.start_dt
FROM user_his b
WHERE NOT EXISTS
(SELECT id
FROM user_his c
WHERE c.end_dt = from_timestamp(now(), 'yyyy-MM-dd') and b.id = c.id)
AND b.end_dt = '9999-12-31'
union ALL
SELECT a.id,
a.username,
a.birthday,
from_timestamp(now(), 'yyyy-MM-dd') AS start_dt
FROM USER a
WHERE NOT EXISTS
(SELECT 1
FROM user_his b
WHERE end_dt = '9999-12-31'
AND a.id = b.id);