datax 同步数据到hive分区表支持orc格式码 oracle同步到hive

转载

小鱼儿 2024-03-11 23:54:53

文章标签 hive oracle big data 数据字段 文章分类 Hive 大数据

## Oracle中数据迁移到hive平台

序言：数据从Oracle平台导出为csv格式，上传至hdfs上，在hive上load至临时表，最后insert到目标表，完成数据Oracle中数据迁移到hive平台。（表名及字段等信息属于公司敏感信息，用别的信息代替展示）

#### 1、将数据从oracle导出为csv文件格式

oracle导出数据采用sqluldr 工具

```shell
sqluldr user/pasword query='select * from student where age = 10' file='/home/db01/tunan/student.csv' field=',' charset='ZHS16GBK' safe=yes
-- user:用户名
-- pasword：密码
-- query：sql语句（要导出那张表的哪些字段、过滤条件等信息）
-- file:指定路径及文件名称
-- field：指定字段信息的分割符
-- charset：指定字符集
```

当命令行显示output file /home/db01/tunan/student.csv closed at 100000000rows,size 20000MB.证明导出数据成功。

导出之前请检查空间是否够用。df (这里已采坑空间不足也不报错，知道存储下位置此时数据是不完整的。注意呀~)

当然导入少量数据的话也可以使用PL/SQL Developer 的Export Query Results.. （这种速度是真的慢！）

![image-20210901215714341](https://gitee.com/TuNan86/mapdeport2/raw/master/img/20210901215714.png)

#### 2、将csv文件上传至hdfs

csv文件小于4G可以将数据从linux sz到本地，拖动本地文件直接上传至 hdfs上

![image-20210901220616433](https://gitee.com/TuNan86/mapdeport2/raw/master/img/20210901220616.png)

也可将数据通 -put 到hdfs上

```shell
hdfs dfs -put -f student.csv /user/hive/..
-- /user/hive/.. :hdfs上的路径
```

3、在hive中创建临时表

```sql
CREATE TABLE 'tmp.student_tmp'(
'id' sting comment '主键',
...
)
COMMENT '学生表'
row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde' -- 剔除引号
WITH SERDEPROPERTIES(
'field.delim'='\t',
'serialization.format'='\t'
)
stored as textfile -- 存格式为textfile 也可以指定为orc 格式

alter table tmp.student_tmp set tablproperties('skip.header.line.count'='1') --剔除首行（检查自己导出的csv文件是否有首行字段名，如果没有就不要加这一行，否则结果会少一行数据）
```

在hive中执行以上语句；

#### 4、将数据load到临时表中

```shell
load data inpath '/user/hive/....../student.csv' overwrite into table tmp.student_tmp
```

load结束检查数据是否load成功。打开PL/SQL Developer 执行SQL语句对照字段名对照数据，总条数等

```sql
select * from student -- 在PL/SQL Developer 执行
select * from tmp.student_tmp -- 在hive上执行
-- 对照字段对应情况，查看数据已经插入
select count(*) from student --在PL/SQL Developer 执行
select count(*) from tmp.student_tmp -- 在hive上执行
-- 对照两个平台上的数据总数是否一致，如果hive上少一条，检查首行是否重复剔除
```

#### 5、在hive平台中创建目标表

```sql
create table 'ods.student'( --临时表对应，如果目标表为分区表指定分区字段
'id' sting comment '主键',
...
)
COMMENT '学生表'
-- PARTITIONED BY ('age' string) -- 指定分区字段
row format serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
stored as textfile -- 存格式为textfile 也可以指定为orc 格式
LOCATION
'hdfs://ns1/user/hive/warehouse/ods.db/student'
```

执行该shell

#### 6、insert into 目标表 select 字段 from 临时表验证

```sql
--开启动态分区
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;

-- 将临时表的中数据插入到目标表中
insert into table ods.student partition(age) -- 如果目标表是分区表，加上开区动态分区字段，加上partition（分区字段）
select
id ,
....
from tmp.student_tmp
```

再次验证数据准确性

```sql
select * from student -- 在PL/SQL Developer 执行
select * from ods.student -- 在hive上执行
-- 对照字段对应情况，查看数据已经插入
select count(*) from student --在PL/SQL Developer 执行
select count(*) from ods.student -- 在hive上执行
-- 对照两个平台上的数据总数是否一致
```

#### 补充：迁移中可能遇到的问题：

1、使用PL/SQL Developer 将数据从oracle导出csv文件速度太慢

可以使用sqluldr工具。

2、数据条数不对

可能是导出时存储空间不足，只能导出一部分数据，导出前df一下预估一下空间是否充足

3、csv数据太大，上传到hdfs集群速度慢，长传中途失败

可以使用 scp [可选参数] file_source file_target

scp student.csv file_target

当然也可以使用split 命令将大文件拆分，按照数据量大小如：3G /按照条数如：1000000条

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。