摘要:基于Flink1.14.4 + Iceberg0.13.2 , 使用FlinkStream API 操作Iceberg,包含使用catalog 类型为hadoop 以及hive的表的创建、批量读取、流式读取、追加、覆盖、修改表结构、小文件合并,分别就DataStream<Row 及DataStream<RowData 两种输入类型进行数据的输入转换。
1. 官方文档
官方地址:https://iceberg.apache.org/docs/latest/flink/
1.1 使用 DataStream 读取
Iceberg 现在支持 Java API 中的流式或批量读取。
批量读取
此示例将从冰山表中读取所有记录,然后在 flink 批处理作业中打印到标准输出控制台:
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
DataStream<RowData> batch = FlinkSource.forRowData()
.env(env)
.tableLoader(tableLoader)
.streaming(false)
.build();
// Print all records to stdout.
batch.print();
// Submit and execute this batch read job.
env.execute("Test Iceberg Batch Read");
1.2 流式读取
此示例将读取从快照 id ‘3821550127947089987’ 开始的增量记录,并在 flink 流作业中打印到标准输出控制台:
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
DataStream<RowData> stream = FlinkSource.forRowData()
.env(env)
.tableLoader(tableLoader)
.streaming(true)
.startSnapshotId(3821550127947089987L)
.build();
// Print all records to stdout.
stream.print();
// Submit and execute this streaming read job.
env.execute("Test Iceberg Streaming Read");
1.3 追加数据
Iceberg 支持从不同的 DataStream 输入写入 iceberg 表。
我们本机支持写作DataStream和DataStream水槽冰山表。
StreamExecutionEnvironment env = ...;
DataStream<RowData> input = ... ;
Configuration hadoopConf = new Configuration();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf);
FlinkSink.forRowData(input)
.tableLoader(tableLoader)
.build();
env.execute("Test Iceberg DataStream");
冰山 API 还允许用户编写通用DataStream的冰山表,更多示例可以在这个单元测试中找到。
1.4 覆盖数据
要动态覆盖现有冰山表中的数据,我们可以overwrite在 FlinkSink builder 中设置标志。
StreamExecutionEnvironment env = ...;
DataStream<RowData> input = ... ;
Configuration hadoopConf = new Configuration();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf);
FlinkSink.forRowData(input)
.tableLoader(tableLoader)
.overwrite(true)
.build();
env.execute("Test Iceberg DataStream");
1.5 重写文件操作。
Iceberg 提供 API,通过提交 flink 批处理作业将小文件重写为大文件。此 flink 操作的行为与 spark 的rewriteDataFiles相同。
import org.apache.iceberg.flink.actions.Actions;
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
Table table = tableLoader.loadTable();
RewriteDataFilesActionResult result = Actions.forTable(table)
.rewriteDataFiles()
.execute();
2. 本地案例
2.1 批量读取catalog类型为hadoop的表
创建类 work.jiang.iceberg.flinkstream.hadoopcatalog.ButchReadHadoopCatalogIcebergTable
代码如下
package work.jiang.iceberg.flinkstream.hadoopcatalog;
import work.jiang.iceberg.util.IcebergUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
/**
* 读是在已存在表的基础上,进行读取的,如果表不存在会报错
*/
public class ButchReadHadoopCatalogIcebergTable {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
// 批量读取catalog 类型为 hadoop 的表
DataStream<RowData> batch = IcebergUtil.butchReadHadoopTable(env,"hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");
// Print all records to stdout.
batch.map(new MapFunction<RowData, String>() {
@Override
public String map(RowData value) throws Exception {
return value.getLong(0) +": " + value.getString(1);
}
}).print();
// Submit and execute this batch read job.
env.execute("Test Iceberg Batch Read");
}
}
2.2 批量读取catalog类型为hive的表
package work.jiang.iceberg.flinkstream.hivecatalog;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
import work.jiang.iceberg.util.IcebergUtil;
public class ButchReadHiveCatalogIcebergTable {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
// 批量读取catalog 类型为hive 的表
DataStream<RowData> batch = IcebergUtil.butchReadFormHiveCatalog(env,"iceberg_db_hive","sample");
// Print all records to stdout.
batch.map(new MapFunction<RowData, String>() {
@Override
public String map(RowData value) throws Exception {
return value.getLong(0) +": " + value.getString(1);
}
}).print();
// Submit and execute this batch read job.
env.execute("Test Iceberg Batch Read");
}
}
2.3 流式读取catalog类型为hadoop的表
package work.jiang.iceberg.flinkstream.hadoopcatalog;
import work.jiang.iceberg.util.IcebergUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
public class StreamReadHadoopCatalogIcebergTable {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStream<RowData> stream = IcebergUtil.streamReadHadoopTable(env, "hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1",1359438792274225442L);
// Print all records to stdout.
// snap-1359438792274225442-1-64b39fbc-9006-4d5f-80c5-b2387d613800.avro
// snap-SnapshotId-****
stream.map(new MapFunction<RowData, String>() {
@Override
public String map(RowData value) throws Exception {
return value.getLong(0) +": "+ value.getString(1);
}
}).print();
// Submit and execute this streaming read job.
env.execute("Test Iceberg Streaming Read");
}
}
2.4 流式读取catalog类型为hive的表
package work.jiang.iceberg.flinkstream.hivecatalog;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
import work.jiang.iceberg.util.IcebergUtil;
public class StreamReadHiveCatalogIcebergTable {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
DataStream<RowData> stream = IcebergUtil.streamReadFormHiveCatalog(env,"iceberg_db_hive","sample");
// Print all records to stdout.
// snap-1359438792274225442-1-64b39fbc-9006-4d5f-80c5-b2387d613800.avro
// snap-SnapshotId-****
stream.map(new MapFunction<RowData, String>() {
@Override
public String map(RowData value) throws Exception {
return value.getLong(0) +": "+ value.getString(1);
}
}).print();
// Submit and execute this streaming read job.
env.execute("Test Iceberg Streaming Read");
}
}
新建工具类work.jiang.iceberg.util.IcebergUtil
package work.jiang.iceberg.util;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.TableSchema;
import org.apache.flink.table.data.RowData;
import org.apache.flink.types.Row;
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Table;
import org.apache.iceberg.actions.RewriteDataFilesActionResult;
import org.apache.iceberg.catalog.Namespace;
import org.apache.iceberg.catalog.TableIdentifier;
import org.apache.iceberg.flink.CatalogLoader;
import org.apache.iceberg.flink.TableLoader;
import org.apache.iceberg.flink.actions.Actions;
import org.apache.iceberg.flink.sink.FlinkSink;
import org.apache.iceberg.flink.source.FlinkSource;
import org.apache.iceberg.hadoop.HadoopTables;
import org.apache.iceberg.types.Types;
import java.util.HashMap;
import java.util.Map;
public class IcebergUtil {
//批量读
public static DataStream<RowData> butchReadHadoopTable(StreamExecutionEnvironment env, String path){
TableLoader tableLoader = TableLoader.fromHadoopTable(path);
DataStream<RowData> batch = FlinkSource.forRowData()
.env(env)
.tableLoader(tableLoader)
.streaming(false)
.build();
return batch;
}
//流式读取
public static DataStream<RowData> streamReadHadoopTable(StreamExecutionEnvironment env, String path,Long startSnapshotId){
TableLoader tableLoader = TableLoader.fromHadoopTable(path);
DataStream<RowData> stream = FlinkSource.forRowData()
.env(env)
.tableLoader(tableLoader)
.streaming(true)
.startSnapshotId(startSnapshotId)
.build();
return stream;
}
// 批量读取
public static DataStream<RowData> butchReadFormHiveCatalog(StreamExecutionEnvironment env,String databaseName,String tableName){
CatalogLoader loader = createHiveCatalog("hdfs://cdh01:8020/user/iceberg/hive_catalog");
//获取表定义
TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of(databaseName), tableName);
TableLoader tableLoader = TableLoader.fromCatalog(loader,tableIdentifier);
DataStream<RowData> batch = FlinkSource.forRowData()
.env(env)
.tableLoader(tableLoader)
.streaming(false)
.build();
return batch;
}
//流式读取
public static DataStream<RowData> streamReadFormHiveCatalog(StreamExecutionEnvironment env,String databaseName,String tableName){
CatalogLoader loader = createHiveCatalog("hdfs://cdh01:8020/user/iceberg/hive_catalog");
//获取表定义
TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of(databaseName), tableName);
TableLoader tableLoader = TableLoader.fromCatalog(loader,tableIdentifier);
DataStream<RowData> batch = FlinkSource.forRowData()
.env(env)
.tableLoader(tableLoader)
.streaming(true)
.build();
return batch;
}
}
2.5 创建hadoop catalog 的表
在work.jiang.iceberg.flinkstream.WriteIceberg类
添加方法createTableByHadoopTables
//使用HadoopTables 创建 Iceberg 表,在创建表的同时自动创建目录,已测通
public static void createTableByHadoopTables(){
org.apache.iceberg.Schema iceberg_schema = new Schema(
Types.NestedField.optional(1, "id", Types.LongType.get()),
Types.NestedField.optional(2, "order_id", Types.LongType.get()),
Types.NestedField.optional(3, "product_id", Types.LongType.get()),
Types.NestedField.optional(4, "product_price", Types.StringType.get()),
Types.NestedField.optional(5, "product_quantity", Types.IntegerType.get()),
Types.NestedField.optional(6, "product_name", Types.StringType.get())
);
String location = "hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_3";
IcebergUtil.createTableByHadoopTables(iceberg_schema,"id",location);
在work.jiang.iceberg.util.IcebergUtil类添加方法 createTableByHadoopTables
//创建表,带分区
public static void createTableByHadoopTables(org.apache.iceberg.Schema iceberg_schema,String partitionColumnName,String location){
//设置分区列
PartitionSpec iceberg_partition = PartitionSpec.builderFor(iceberg_schema).identity(partitionColumnName).build();
HadoopTables hadoopTable = new HadoopTables(new Configuration());
if(!hadoopTable.exists(location)){
hadoopTable.create(iceberg_schema, iceberg_partition, location);
}
}
在work.jiang.iceberg.flinkstream.WriteIceberg类 main方法
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//创建表
createTableByHadoopTables();
env.execute();
}
然后去hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/路径下,看见已经生成了t_iceberg_sample_3 文件,里边有metadata表的元数据文件,表创建成功
2.6 创建haive catalog 的表
在work.jiang.iceberg.flinkstream.WriteIceberg添加方法createTableByHiveTables
//使用HiveTables 创建 Iceberg 表,在创建表的同时自动创建目录,已测通
public static void createTableByHiveTables(){
org.apache.iceberg.Schema iceberg_schema = new Schema(
Types.NestedField.optional(1, "id", Types.LongType.get()),
Types.NestedField.optional(2, "order_id", Types.LongType.get()),
Types.NestedField.optional(3, "product_id", Types.LongType.get()),
Types.NestedField.optional(4, "product_price", Types.StringType.get()),
Types.NestedField.optional(5, "product_quantity", Types.IntegerType.get()),
Types.NestedField.optional(6, "product_name", Types.StringType.get())
);
String cataloglocation= "hdfs://cdh01:8020/user/iceberg/hive_catalog";
IcebergUtil.createTableByHiveTables(iceberg_schema,"product_id",cataloglocation,"iceberg_db_hive","t_iceberg_sample_4");
}
在work.jiang.iceberg.util.IcebergUtil类添加方法createTableByHiveTables
//创建表,带分区,hivecatalog 表
public static void createTableByHiveTables(org.apache.iceberg.Schema iceberg_schema,String partitionColumnName,String cataloglocation,String databaseName,String tableName){
CatalogLoader hive_catalog = IcebergUtil.createHiveCatalog(cataloglocation);
//表定义
TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of(databaseName), tableName);
//设置分区列
PartitionSpec iceberg_partition = PartitionSpec.builderFor(iceberg_schema).identity(partitionColumnName).build();
//hadoop_catalog.loadCatalog() 可以创建、删除、获取Iceberg表
if(!hive_catalog.loadCatalog().tableExists(tableIdentifier)){
Table tableSource = hive_catalog.loadCatalog().createTable(tableIdentifier,iceberg_schema,iceberg_partition);
}
}
//创建catalog -hive
public static CatalogLoader createHiveCatalog(String catalogPath){
Map<String, String> properties = new HashMap<>();
properties.put("type", "iceberg");
properties.put("catalog-type", "hive");
properties.put("property-version", "1");
properties.put("warehouse", catalogPath);
CatalogLoader hive_catalog = CatalogLoader.hive("hive_catalog", new Configuration(), properties);
return hive_catalog;
}
在work.jiang.iceberg.flinkstream.WriteIceberg类 main方法
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
createTableByHiveTables();
env.execute();
}
2.7 追加catalog 类型为hadoop 的表
2.7.1 输入类型为DataStreamRow
work.jiang.iceberg.flinkstream.WriteIceberg 类
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从集合中写入,已测通
DataStream<Row> input = env.fromCollection(
Arrays.asList(Row.of(10L, "gg"), Row.of(11L, "hh")));
TableSchema schema = TableSchema.builder()
.field("id", DataTypes.BIGINT())
.field("data", DataTypes.STRING())
.build();
//追加
IcebergUtil.appendWriteHadoopTableFromDataStreamRow(input,schema,"hdfs:///user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");
work.jiang.iceberg.util.IcebergUtil类添加方法appendWriteHadoopTableFromDataStreamRow
//写,追加数据
public static void appendWriteHadoopTableFromDataStreamRow(DataStream<Row> input, TableSchema schema, String path){
TableLoader tableLoader = TableLoader.fromHadoopTable(path);
FlinkSink.forRow(input,schema)
.tableLoader(tableLoader)
.build();
}
2.7.2 输入类型为DataStreamRowData
work.jiang.iceberg.flinkstream.WriteIceberg类
//向表中写入数据
public static void dataTest(StreamExecutionEnvironment env,TableLoader tableLoader){
ArrayList<String> list = new ArrayList<>();
Random random = new Random();
for (int i =1 ; i<= 10; i++){
long id = random.nextInt(100);
long orderId = random.nextInt(20);
long product_id = random.nextInt(20);
String product_price = "12";
int product_quantity = random.nextInt(10);
String product_name = "apple" + i;
OrderDetail order = new OrderDetail(id,orderId,product_id,product_price,product_quantity,product_name);
list.add(JSONObject.toJSONString(order));
}
DataStream<String> data = env.fromCollection(list);
DataStream<RowData> input = data.map(item -> {
JSONObject jsonData = JSONObject.parseObject(item);
//参数个数
GenericRowData rowData = new GenericRowData(6);
rowData.setField(0, jsonData.getLongValue("id"));
rowData.setField(1, jsonData.getLongValue("orderId"));
rowData.setField(2, jsonData.getLongValue("product_id"));
rowData.setField(3, StringData.fromString(jsonData.getString("product_price")));
rowData.setField(4, jsonData.getInteger("product_quantity"));
rowData.setField(5, StringData.fromString(jsonData.getString("product_name")));
return rowData;
});
//追加数据
IcebergUtil.appendWriteFromDataStreamRowData(input,tableLoader);
}
work.jiang.iceberg.util.IcebergUtil类
//写,追加数据
public static void appendWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
FlinkSink.forRowData(input).tableLoader(tableLoader).build();
}
2.8 覆盖catalog 类型为hadoop 的表
2.8.1 输入类型为DataStreamRow
work.jiang.iceberg.flinkstream.WriteIceberg类
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//从集合中写入,已测通
DataStream<Row> input = env.fromCollection(
Arrays.asList(Row.of(10L, "gg"), Row.of(11L, "hh")));
TableSchema schema = TableSchema.builder()
.field("id", DataTypes.BIGINT())
.field("data", DataTypes.STRING())
.build();
//覆盖
IcebergUtil.overWriteHadoopTableFromDataStreamRow(input,schema,"hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");
work.jiang.iceberg.util.IcebergUtil类
//写,覆盖数据
public static void overWriteHadoopTableFromDataStreamRow(DataStream<Row> input, TableSchema schema, String path){
TableLoader tableLoader = TableLoader.fromHadoopTable(path);
FlinkSink.forRow(input,schema).tableLoader(tableLoader).overwrite(true).build();
}
2.8.2 输入类型为DataStreamRowData
work.jiang.iceberg.flinkstream.WriteIceberg类
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//向表写入数据
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_3");
dataTest(env,tableLoader);
env.execute();
}
//向表中写入数据
public static void dataTest(StreamExecutionEnvironment env,TableLoader tableLoader){
ArrayList<String> list = new ArrayList<>();
Random random = new Random();
for (int i =1 ; i<= 10; i++){
long id = random.nextInt(100);
long orderId = random.nextInt(20);
long product_id = random.nextInt(20);
String product_price = "12";
int product_quantity = random.nextInt(10);
String product_name = "apple" + i;
OrderDetail order = new OrderDetail(id,orderId,product_id,product_price,product_quantity,product_name);
list.add(JSONObject.toJSONString(order));
}
DataStream<String> data = env.fromCollection(list);
DataStream<RowData> input = data.map(item -> {
JSONObject jsonData = JSONObject.parseObject(item);
//参数个数
GenericRowData rowData = new GenericRowData(6);
rowData.setField(0, jsonData.getLongValue("id"));
rowData.setField(1, jsonData.getLongValue("orderId"));
rowData.setField(2, jsonData.getLongValue("product_id"));
rowData.setField(3, StringData.fromString(jsonData.getString("product_price")));
rowData.setField(4, jsonData.getInteger("product_quantity"));
rowData.setField(5, StringData.fromString(jsonData.getString("product_name")));
return rowData;
});
//覆盖数据
IcebergUtil.overWriteFromDataStreamRowData(input,tableLoader);
}
work.jiang.iceberg.util.IcebergUtil类
//写,覆盖数据
public static void overWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
FlinkSink.forRowData(input).tableLoader(tableLoader).overwrite(true).build();
}
2.9 追加/覆盖catalog 类型为hive 的表
work.jiang.iceberg.flinkstream.WriteIceberg
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TableLoader tableLoader2 = IcebergUtil.getHiveCatalogTableLoader("hdfs://cdh01:8020/user/iceberg/hive_catalog","iceberg_db_hive","t_iceberg_sample_3");
dataTest(env,tableLoader);
env.execute();
}
//向表中写入数据
public static void dataTest(StreamExecutionEnvironment env,TableLoader tableLoader){
ArrayList<String> list = new ArrayList<>();
Random random = new Random();
for (int i =1 ; i<= 10; i++){
long id = random.nextInt(100);
long orderId = random.nextInt(20);
long product_id = random.nextInt(20);
String product_price = "12";
int product_quantity = random.nextInt(10);
String product_name = "apple" + i;
OrderDetail order = new OrderDetail(id,orderId,product_id,product_price,product_quantity,product_name);
list.add(JSONObject.toJSONString(order));
}
DataStream<String> data = env.fromCollection(list);
DataStream<RowData> input = data.map(item -> {
JSONObject jsonData = JSONObject.parseObject(item);
//参数个数
GenericRowData rowData = new GenericRowData(6);
rowData.setField(0, jsonData.getLongValue("id"));
rowData.setField(1, jsonData.getLongValue("orderId"));
rowData.setField(2, jsonData.getLongValue("product_id"));
rowData.setField(3, StringData.fromString(jsonData.getString("product_price")));
rowData.setField(4, jsonData.getInteger("product_quantity"));
rowData.setField(5, StringData.fromString(jsonData.getString("product_name")));
return rowData;
});
//覆盖数据
IcebergUtil.overWriteFromDataStreamRowData(input,tableLoader);
//追加数据
IcebergUtil.appendWriteFromDataStreamRowData(input,tableLoader);
}
work.jiang.iceberg.util.IcebergUtil
//写,覆盖数据
public static void overWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
FlinkSink.forRowData(input).tableLoader(tableLoader).overwrite(true).build();
}
//写,追加数据
public static void appendWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
FlinkSink.forRowData(input).tableLoader(tableLoader).build();
}
2.11 修改 catalog 类型为hadoop 的表结构
work.jiang.iceberg.flinkstream.WriteIceberg
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//更改表结构
IcebergUtil.changeSchemaHadoopTable();
DataStream<Row> input = env.fromCollection(
Arrays.asList(Row.of(10L, "gg","2022-07-06 16:10:11"), Row.of(11L, "hh","2022-07-06 16:10:12")));
TableSchema schema = TableSchema.builder()
.field("id", DataTypes.BIGINT())
.field("data", DataTypes.STRING()).field("updatetime", DataTypes.STRING())
.build();
IcebergUtil.appendWriteHadoopTableFromDataStreamRow(input,schema,"hdfs:///user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");
env.execute();
}
work.jiang.iceberg.util.IcebergUtil
public class IcebergUtil {
/**
* 更新表的可用操作是:
* updateSchema– 更新表模式
* updateProperties– 更新表属性
* updateLocation– 更新表的基本位置
* newAppend– 用于附加数据文件
* newFastAppend– 用于附加数据文件,不会压缩元数据
* newOverwrite– 用于附加数据文件和删除被覆盖的文件
* newDelete– 用于删除数据文件
* newRewrite– 用于重写数据文件;将用新版本替换现有文件
* newTransaction– 创建一个新的表级事务
* rewriteManifests– 通过集群文件重写清单数据,以加快扫描计划
* rollback– 将表状态回滚到特定快照
*/
public static void changeSchemaHadoopTable(){
CatalogLoader hadoop_catalog = IcebergUtil.createHadoopCatalog("hdfs://cdh01:8020/user/iceberg/hadoop_catalog");
//获取表定义
TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of("iceberg_db"), "t_iceberg_sample_1");
//获取表
Table tableSource = hadoop_catalog.loadCatalog().loadTable(tableIdentifier);
tableSource.updateSchema()
.addColumn("updatetime", Types.StringType.get())
.commit();
}
//写,追加数据
public static void appendWriteHadoopTableFromDataStreamRow(DataStream<Row> input, TableSchema schema, String path){
TableLoader tableLoader = TableLoader.fromHadoopTable(path);
FlinkSink.forRow(input,schema)
.tableLoader(tableLoader)
.build();
}
}
2.12 修改catalog 类型为hive 的表结构
work.jiang.iceberg.util.IcebergUtil
修改catalog类型为hive的表结构与 hadoop类型类似,只是获取的catalog不一样
public static void changeSchemaHiveTable(){
CatalogLoader hive_catalog = IcebergUtil.createHiveCatalog("hdfs://cdh01:8020/user/iceberg/hadoop_catalog");
//获取表定义
TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of("iceberg_db_hive"), "t_iceberg_sample_1");
//获取表
Table tableSource = hive_catalog.loadCatalog().loadTable(tableIdentifier);
tableSource.updateSchema()
.addColumn("updatetime", Types.StringType.get())
.commit();
}
2.13 小文件合并 catalog 类型为hadoop 的表
public class WriteIceberg {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//合并小文件
IcebergUtil.overWriteHadoopTableMiniFileToBigFile("hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_2");
env.execute();
}
work.jiang.iceberg.util.IcebergUtil
//重写文件,Iceberg 提供 API,通过提交 flink 批处理作业将小文件重写为大文件
public static void overWriteHadoopTableMiniFileToBigFile(String location){
TableLoader tableLoader = TableLoader.fromHadoopTable(location);
Table table = tableLoader.loadTable();
RewriteDataFilesActionResult result = Actions.forTable(table)
.rewriteDataFiles()
.execute();
}
2.14 小文件合并 catalog 类型为hive 的表
与catalog 为hadoop类型类似,只是获取tableLoader方式不一样
//重写文件,Iceberg 提供 API,通过提交 flink 批处理作业将小文件重写为大文件
public static void overWriteHiveTableMiniFileToBigFile(String hivecatalogLocation,String databaseName,String tableName){
TableLoader tableLoader = getHiveCatalogTableLoader(hivecatalogLocation, databaseName,tableName);
Table table = tableLoader.loadTable();
RewriteDataFilesActionResult result = Actions.forTable(table)
.rewriteDataFiles()
.execute();
}