flink 读取 obs source flink 读取obs数据

转载

mob64ca140a8e67 2024-03-25 13:25:31

文章标签 flink 读取 obs source 大数据 java flink apache 文章分类 架构后端开发

摘要：基于Flink1.14.4 + Iceberg0.13.2 , 使用FlinkStream API 操作Iceberg，包含使用catalog 类型为hadoop 以及hive的表的创建、批量读取、流式读取、追加、覆盖、修改表结构、小文件合并，分别就DataStream<Row 及DataStream<RowData 两种输入类型进行数据的输入转换。

1. 官方文档

官方地址：https://iceberg.apache.org/docs/latest/flink/

1.1 使用 DataStream 读取

Iceberg 现在支持 Java API 中的流式或批量读取。

批量读取
此示例将从冰山表中读取所有记录，然后在 flink 批处理作业中打印到标准输出控制台：

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
DataStream<RowData> batch = FlinkSource.forRowData()
     .env(env)
     .tableLoader(tableLoader)
     .streaming(false)
     .build();

// Print all records to stdout.
batch.print();

// Submit and execute this batch read job.
env.execute("Test Iceberg Batch Read");

1.2 流式读取

此示例将读取从快照 id ‘3821550127947089987’ 开始的增量记录，并在 flink 流作业中打印到标准输出控制台：

StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
DataStream<RowData> stream = FlinkSource.forRowData()
     .env(env)
     .tableLoader(tableLoader)
     .streaming(true)
     .startSnapshotId(3821550127947089987L)
     .build();

// Print all records to stdout.
stream.print();

// Submit and execute this streaming read job.
env.execute("Test Iceberg Streaming Read");

1.3 追加数据

Iceberg 支持从不同的 DataStream 输入写入 iceberg 表。

我们本机支持写作DataStream和DataStream水槽冰山表。

StreamExecutionEnvironment env = ...;

DataStream<RowData> input = ... ;
Configuration hadoopConf = new Configuration();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf);

FlinkSink.forRowData(input)
    .tableLoader(tableLoader)
    .build();

env.execute("Test Iceberg DataStream");

冰山 API 还允许用户编写通用DataStream的冰山表，更多示例可以在这个单元测试中找到。

1.4 覆盖数据

要动态覆盖现有冰山表中的数据，我们可以overwrite在 FlinkSink builder 中设置标志。

StreamExecutionEnvironment env = ...;

DataStream<RowData> input = ... ;
Configuration hadoopConf = new Configuration();
TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path", hadoopConf);

FlinkSink.forRowData(input)
    .tableLoader(tableLoader)
    .overwrite(true)
    .build();

env.execute("Test Iceberg DataStream");

1.5 重写文件操作。

Iceberg 提供 API，通过提交 flink 批处理作业将小文件重写为大文件。此 flink 操作的行为与 spark 的rewriteDataFiles相同。

import org.apache.iceberg.flink.actions.Actions;

TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://nn:8020/warehouse/path");
Table table = tableLoader.loadTable();
RewriteDataFilesActionResult result = Actions.forTable(table)
        .rewriteDataFiles()
        .execute();

2. 本地案例

2.1 批量读取catalog类型为hadoop的表

创建类 work.jiang.iceberg.flinkstream.hadoopcatalog.ButchReadHadoopCatalogIcebergTable
代码如下

package work.jiang.iceberg.flinkstream.hadoopcatalog;

import work.jiang.iceberg.util.IcebergUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;

/**
 * 读是在已存在表的基础上，进行读取的，如果表不存在会报错
 */
public class ButchReadHadoopCatalogIcebergTable {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        // 批量读取catalog 类型为 hadoop 的表
        DataStream<RowData> batch = IcebergUtil.butchReadHadoopTable(env,"hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");

        // Print all records to stdout.
        batch.map(new MapFunction<RowData, String>() {
            @Override
            public String map(RowData value) throws Exception {
                return value.getLong(0) +": " + value.getString(1);
            }
        }).print();

        // Submit and execute this batch read job.
        env.execute("Test Iceberg Batch Read");
    }
}

2.2 批量读取catalog类型为hive的表

package work.jiang.iceberg.flinkstream.hivecatalog;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
import work.jiang.iceberg.util.IcebergUtil;

public class ButchReadHiveCatalogIcebergTable {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        // 批量读取catalog 类型为hive 的表
        DataStream<RowData> batch = IcebergUtil.butchReadFormHiveCatalog(env,"iceberg_db_hive","sample");

        // Print all records to stdout.
        batch.map(new MapFunction<RowData, String>() {
            @Override
            public String map(RowData value) throws Exception {
                return value.getLong(0) +": " + value.getString(1);
            }
        }).print();

        // Submit and execute this batch read job.
        env.execute("Test Iceberg Batch Read");
    }
}

2.3 流式读取catalog类型为hadoop的表

package work.jiang.iceberg.flinkstream.hadoopcatalog;

import work.jiang.iceberg.util.IcebergUtil;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;

public class StreamReadHadoopCatalogIcebergTable {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
        DataStream<RowData> stream = IcebergUtil.streamReadHadoopTable(env, "hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1",1359438792274225442L);

        // Print all records to stdout.
        //  snap-1359438792274225442-1-64b39fbc-9006-4d5f-80c5-b2387d613800.avro
        // snap-SnapshotId-****
        stream.map(new MapFunction<RowData, String>() {
            @Override
            public String map(RowData value) throws Exception {
                return value.getLong(0) +": "+ value.getString(1);
            }
        }).print();

        // Submit and execute this streaming read job.
        env.execute("Test Iceberg Streaming Read");
    }
}

2.4 流式读取catalog类型为hive的表

package work.jiang.iceberg.flinkstream.hivecatalog;

import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.data.RowData;
import work.jiang.iceberg.util.IcebergUtil;

public class StreamReadHiveCatalogIcebergTable {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        DataStream<RowData> stream = IcebergUtil.streamReadFormHiveCatalog(env,"iceberg_db_hive","sample");

        // Print all records to stdout.
        //  snap-1359438792274225442-1-64b39fbc-9006-4d5f-80c5-b2387d613800.avro
        // snap-SnapshotId-****
        stream.map(new MapFunction<RowData, String>() {
            @Override
            public String map(RowData value) throws Exception {
                return value.getLong(0) +": "+ value.getString(1);
            }
        }).print();

        // Submit and execute this streaming read job.
        env.execute("Test Iceberg Streaming Read");
    }
}

新建工具类work.jiang.iceberg.util.IcebergUtil

package work.jiang.iceberg.util;

import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.TableSchema;
import org.apache.flink.table.data.RowData;
import org.apache.flink.types.Row;
import org.apache.hadoop.conf.Configuration;
import org.apache.iceberg.PartitionSpec;
import org.apache.iceberg.Table;
import org.apache.iceberg.actions.RewriteDataFilesActionResult;
import org.apache.iceberg.catalog.Namespace;
import org.apache.iceberg.catalog.TableIdentifier;
import org.apache.iceberg.flink.CatalogLoader;
import org.apache.iceberg.flink.TableLoader;
import org.apache.iceberg.flink.actions.Actions;
import org.apache.iceberg.flink.sink.FlinkSink;
import org.apache.iceberg.flink.source.FlinkSource;
import org.apache.iceberg.hadoop.HadoopTables;
import org.apache.iceberg.types.Types;

import java.util.HashMap;
import java.util.Map;

public  class IcebergUtil {
//批量读
    public static DataStream<RowData> butchReadHadoopTable(StreamExecutionEnvironment env, String path){
        TableLoader tableLoader = TableLoader.fromHadoopTable(path);

        DataStream<RowData> batch = FlinkSource.forRowData()
                .env(env)
                .tableLoader(tableLoader)
                .streaming(false)
                .build();
        return batch;
    }

//流式读取
    public static DataStream<RowData> streamReadHadoopTable(StreamExecutionEnvironment env, String path,Long startSnapshotId){
        TableLoader tableLoader = TableLoader.fromHadoopTable(path);
        DataStream<RowData> stream = FlinkSource.forRowData()
                .env(env)
                .tableLoader(tableLoader)
                .streaming(true)
                .startSnapshotId(startSnapshotId)
                .build();
        return stream;
    }

 // 批量读取
    public static DataStream<RowData> butchReadFormHiveCatalog(StreamExecutionEnvironment env,String databaseName,String tableName){
        CatalogLoader loader = createHiveCatalog("hdfs://cdh01:8020/user/iceberg/hive_catalog");
        //获取表定义
        TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of(databaseName), tableName);
        TableLoader tableLoader = TableLoader.fromCatalog(loader,tableIdentifier);

        DataStream<RowData> batch = FlinkSource.forRowData()
                .env(env)
                .tableLoader(tableLoader)
                .streaming(false)
                .build();
        return batch;
    }

 //流式读取
    public static DataStream<RowData> streamReadFormHiveCatalog(StreamExecutionEnvironment env,String databaseName,String tableName){
        CatalogLoader loader = createHiveCatalog("hdfs://cdh01:8020/user/iceberg/hive_catalog");
        //获取表定义
        TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of(databaseName), tableName);
        TableLoader tableLoader = TableLoader.fromCatalog(loader,tableIdentifier);

        DataStream<RowData> batch = FlinkSource.forRowData()
                .env(env)
                .tableLoader(tableLoader)
                .streaming(true)
                .build();
        return batch;
    }

}

2.5 创建hadoop catalog 的表

在work.jiang.iceberg.flinkstream.WriteIceberg类
添加方法createTableByHadoopTables

//使用HadoopTables 创建 Iceberg 表,在创建表的同时自动创建目录，已测通
    public static void createTableByHadoopTables(){
        org.apache.iceberg.Schema iceberg_schema = new Schema(
                Types.NestedField.optional(1, "id", Types.LongType.get()),
                Types.NestedField.optional(2, "order_id", Types.LongType.get()),
                Types.NestedField.optional(3, "product_id", Types.LongType.get()),
                Types.NestedField.optional(4, "product_price", Types.StringType.get()),
                Types.NestedField.optional(5, "product_quantity", Types.IntegerType.get()),
                Types.NestedField.optional(6, "product_name", Types.StringType.get())
        );

        String location = "hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_3";

        IcebergUtil.createTableByHadoopTables(iceberg_schema,"id",location);

在work.jiang.iceberg.util.IcebergUtil类添加方法 createTableByHadoopTables

//创建表,带分区
    public static void createTableByHadoopTables(org.apache.iceberg.Schema iceberg_schema,String partitionColumnName,String location){
        //设置分区列
        PartitionSpec iceberg_partition = PartitionSpec.builderFor(iceberg_schema).identity(partitionColumnName).build();

        HadoopTables hadoopTable = new HadoopTables(new Configuration());

        if(!hadoopTable.exists(location)){
            hadoopTable.create(iceberg_schema, iceberg_partition, location);
        }
    }

在work.jiang.iceberg.flinkstream.WriteIceberg类 main方法

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //创建表
        createTableByHadoopTables();
        env.execute();
    }

然后去hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/路径下，看见已经生成了t_iceberg_sample_3 文件，里边有metadata表的元数据文件，表创建成功

2.6 创建haive catalog 的表

在work.jiang.iceberg.flinkstream.WriteIceberg添加方法createTableByHiveTables

//使用HiveTables 创建 Iceberg 表,在创建表的同时自动创建目录，已测通
    public static void createTableByHiveTables(){
        org.apache.iceberg.Schema iceberg_schema = new Schema(
                Types.NestedField.optional(1, "id", Types.LongType.get()),
                Types.NestedField.optional(2, "order_id", Types.LongType.get()),
                Types.NestedField.optional(3, "product_id", Types.LongType.get()),
                Types.NestedField.optional(4, "product_price", Types.StringType.get()),
                Types.NestedField.optional(5, "product_quantity", Types.IntegerType.get()),
                Types.NestedField.optional(6, "product_name", Types.StringType.get())
        );
        String cataloglocation= "hdfs://cdh01:8020/user/iceberg/hive_catalog";

        IcebergUtil.createTableByHiveTables(iceberg_schema,"product_id",cataloglocation,"iceberg_db_hive","t_iceberg_sample_4");
    }

在work.jiang.iceberg.util.IcebergUtil类添加方法createTableByHiveTables

//创建表,带分区，hivecatalog 表
    public static void createTableByHiveTables(org.apache.iceberg.Schema iceberg_schema,String partitionColumnName,String cataloglocation,String databaseName,String tableName){
        CatalogLoader hive_catalog = IcebergUtil.createHiveCatalog(cataloglocation);
        //表定义
        TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of(databaseName), tableName);
        //设置分区列
        PartitionSpec iceberg_partition = PartitionSpec.builderFor(iceberg_schema).identity(partitionColumnName).build();
        //hadoop_catalog.loadCatalog() 可以创建、删除、获取Iceberg表
        if(!hive_catalog.loadCatalog().tableExists(tableIdentifier)){
            Table tableSource = hive_catalog.loadCatalog().createTable(tableIdentifier,iceberg_schema,iceberg_partition);
        }
    }

//创建catalog -hive
    public static CatalogLoader  createHiveCatalog(String catalogPath){
        Map<String, String> properties = new HashMap<>();
        properties.put("type", "iceberg");
        properties.put("catalog-type", "hive");
        properties.put("property-version", "1");
        properties.put("warehouse", catalogPath);
        CatalogLoader hive_catalog = CatalogLoader.hive("hive_catalog", new Configuration(), properties);
        return hive_catalog;
    }

在work.jiang.iceberg.flinkstream.WriteIceberg类 main方法

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        createTableByHiveTables();
        env.execute();
    }

2.7 追加catalog 类型为hadoop 的表

2.7.1 输入类型为DataStreamRow

work.jiang.iceberg.flinkstream.WriteIceberg 类

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         //从集合中写入，已测通
         DataStream<Row> input = env.fromCollection(
         Arrays.asList(Row.of(10L, "gg"), Row.of(11L, "hh")));

         TableSchema schema = TableSchema.builder()
         .field("id", DataTypes.BIGINT())
         .field("data", DataTypes.STRING())
         .build();

         //追加
        IcebergUtil.appendWriteHadoopTableFromDataStreamRow(input,schema,"hdfs:///user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");

work.jiang.iceberg.util.IcebergUtil类添加方法appendWriteHadoopTableFromDataStreamRow

//写，追加数据
    public static void appendWriteHadoopTableFromDataStreamRow(DataStream<Row> input, TableSchema schema, String path){
        TableLoader tableLoader = TableLoader.fromHadoopTable(path);
        FlinkSink.forRow(input,schema)
                .tableLoader(tableLoader)
                .build();
    }

2.7.2 输入类型为DataStreamRowData

work.jiang.iceberg.flinkstream.WriteIceberg类

//向表中写入数据
    public static void dataTest(StreamExecutionEnvironment env,TableLoader tableLoader){
        ArrayList<String> list = new ArrayList<>();
        Random random = new Random();

        for (int i =1 ; i<= 10; i++){
            long id = random.nextInt(100);
            long orderId =  random.nextInt(20);
            long product_id = random.nextInt(20);
            String product_price = "12";
            int product_quantity = random.nextInt(10);
            String product_name = "apple" + i;
            OrderDetail order = new OrderDetail(id,orderId,product_id,product_price,product_quantity,product_name);

            list.add(JSONObject.toJSONString(order));
        }

        DataStream<String> data = env.fromCollection(list);

        DataStream<RowData> input = data.map(item -> {
            JSONObject jsonData = JSONObject.parseObject(item);
            //参数个数
            GenericRowData rowData = new GenericRowData(6);
            rowData.setField(0, jsonData.getLongValue("id"));
            rowData.setField(1, jsonData.getLongValue("orderId"));
            rowData.setField(2, jsonData.getLongValue("product_id"));
            rowData.setField(3, StringData.fromString(jsonData.getString("product_price")));
            rowData.setField(4, jsonData.getInteger("product_quantity"));
            rowData.setField(5, StringData.fromString(jsonData.getString("product_name")));
            return rowData;
        });

        //追加数据
        IcebergUtil.appendWriteFromDataStreamRowData(input,tableLoader);
    }

work.jiang.iceberg.util.IcebergUtil类

//写，追加数据
    public static void appendWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
        FlinkSink.forRowData(input).tableLoader(tableLoader).build();
    }

2.8 覆盖catalog 类型为hadoop 的表

2.8.1 输入类型为DataStreamRow

work.jiang.iceberg.flinkstream.WriteIceberg类

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //从集合中写入，已测通
         DataStream<Row> input = env.fromCollection(
         Arrays.asList(Row.of(10L, "gg"), Row.of(11L, "hh")));

         TableSchema schema = TableSchema.builder()
         .field("id", DataTypes.BIGINT())
         .field("data", DataTypes.STRING())
         .build();
 //覆盖
        IcebergUtil.overWriteHadoopTableFromDataStreamRow(input,schema,"hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");

work.jiang.iceberg.util.IcebergUtil类

//写，覆盖数据
    public static void overWriteHadoopTableFromDataStreamRow(DataStream<Row> input, TableSchema schema, String path){
        TableLoader tableLoader = TableLoader.fromHadoopTable(path);
        FlinkSink.forRow(input,schema).tableLoader(tableLoader).overwrite(true).build();
    }

2.8.2 输入类型为DataStreamRowData

work.jiang.iceberg.flinkstream.WriteIceberg类

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //向表写入数据
        TableLoader tableLoader = TableLoader.fromHadoopTable("hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_3");
        dataTest(env,tableLoader);
        env.execute();
    }
    
    //向表中写入数据
    public static void dataTest(StreamExecutionEnvironment env,TableLoader tableLoader){
        ArrayList<String> list = new ArrayList<>();
        Random random = new Random();

        for (int i =1 ; i<= 10; i++){
            long id = random.nextInt(100);
            long orderId =  random.nextInt(20);
            long product_id = random.nextInt(20);
            String product_price = "12";
            int product_quantity = random.nextInt(10);
            String product_name = "apple" + i;
            OrderDetail order = new OrderDetail(id,orderId,product_id,product_price,product_quantity,product_name);

            list.add(JSONObject.toJSONString(order));
        }

        DataStream<String> data = env.fromCollection(list);

        DataStream<RowData> input = data.map(item -> {
            JSONObject jsonData = JSONObject.parseObject(item);
            //参数个数
            GenericRowData rowData = new GenericRowData(6);
            rowData.setField(0, jsonData.getLongValue("id"));
            rowData.setField(1, jsonData.getLongValue("orderId"));
            rowData.setField(2, jsonData.getLongValue("product_id"));
            rowData.setField(3, StringData.fromString(jsonData.getString("product_price")));
            rowData.setField(4, jsonData.getInteger("product_quantity"));
            rowData.setField(5, StringData.fromString(jsonData.getString("product_name")));
            return rowData;
        });

        //覆盖数据
 IcebergUtil.overWriteFromDataStreamRowData(input,tableLoader);
    }

work.jiang.iceberg.util.IcebergUtil类

//写，覆盖数据
    public static void overWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
        FlinkSink.forRowData(input).tableLoader(tableLoader).overwrite(true).build();
    }

2.9 追加/覆盖catalog 类型为hive 的表

work.jiang.iceberg.flinkstream.WriteIceberg

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
         TableLoader tableLoader2 = IcebergUtil.getHiveCatalogTableLoader("hdfs://cdh01:8020/user/iceberg/hive_catalog","iceberg_db_hive","t_iceberg_sample_3");
        dataTest(env,tableLoader);
        env.execute();
    }

 //向表中写入数据
    public static void dataTest(StreamExecutionEnvironment env,TableLoader tableLoader){
        ArrayList<String> list = new ArrayList<>();
        Random random = new Random();

        for (int i =1 ; i<= 10; i++){
            long id = random.nextInt(100);
            long orderId =  random.nextInt(20);
            long product_id = random.nextInt(20);
            String product_price = "12";
            int product_quantity = random.nextInt(10);
            String product_name = "apple" + i;
            OrderDetail order = new OrderDetail(id,orderId,product_id,product_price,product_quantity,product_name);

            list.add(JSONObject.toJSONString(order));
        }

        DataStream<String> data = env.fromCollection(list);

        DataStream<RowData> input = data.map(item -> {
            JSONObject jsonData = JSONObject.parseObject(item);
            //参数个数
            GenericRowData rowData = new GenericRowData(6);
            rowData.setField(0, jsonData.getLongValue("id"));
            rowData.setField(1, jsonData.getLongValue("orderId"));
            rowData.setField(2, jsonData.getLongValue("product_id"));
            rowData.setField(3, StringData.fromString(jsonData.getString("product_price")));
            rowData.setField(4, jsonData.getInteger("product_quantity"));
            rowData.setField(5, StringData.fromString(jsonData.getString("product_name")));
            return rowData;
        });

        //覆盖数据
        IcebergUtil.overWriteFromDataStreamRowData(input,tableLoader);

        //追加数据
        IcebergUtil.appendWriteFromDataStreamRowData(input,tableLoader);
    }

work.jiang.iceberg.util.IcebergUtil

//写，覆盖数据
    public static void overWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
        FlinkSink.forRowData(input).tableLoader(tableLoader).overwrite(true).build();
    }
    //写，追加数据
    public static void appendWriteFromDataStreamRowData(DataStream<RowData> input, TableLoader tableLoader){
        FlinkSink.forRowData(input).tableLoader(tableLoader).build();
    }

2.11 修改 catalog 类型为hadoop 的表结构

work.jiang.iceberg.flinkstream.WriteIceberg

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //更改表结构
        IcebergUtil.changeSchemaHadoopTable();

        DataStream<Row> input = env.fromCollection(
         Arrays.asList(Row.of(10L, "gg","2022-07-06 16:10:11"), Row.of(11L, "hh","2022-07-06 16:10:12")));

         TableSchema schema = TableSchema.builder()
         .field("id", DataTypes.BIGINT())
         .field("data", DataTypes.STRING()).field("updatetime", DataTypes.STRING())
         .build();
        IcebergUtil.appendWriteHadoopTableFromDataStreamRow(input,schema,"hdfs:///user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_1");
		env.execute();
    }

work.jiang.iceberg.util.IcebergUtil

public  class IcebergUtil {
/**
     * 更新表的可用操作是：
     * updateSchema– 更新表模式
     * updateProperties– 更新表属性
     * updateLocation– 更新表的基本位置
     * newAppend– 用于附加数据文件
     * newFastAppend– 用于附加数据文件，不会压缩元数据
     * newOverwrite– 用于附加数据文件和删除被覆盖的文件
     * newDelete– 用于删除数据文件
     * newRewrite– 用于重写数据文件；将用新版本替换现有文件
     * newTransaction– 创建一个新的表级事务
     * rewriteManifests– 通过集群文件重写清单数据，以加快扫描计划
     * rollback– 将表状态回滚到特定快照
     */
    public static void changeSchemaHadoopTable(){
        CatalogLoader hadoop_catalog = IcebergUtil.createHadoopCatalog("hdfs://cdh01:8020/user/iceberg/hadoop_catalog");
        //获取表定义
        TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of("iceberg_db"), "t_iceberg_sample_1");
        //获取表
        Table tableSource = hadoop_catalog.loadCatalog().loadTable(tableIdentifier);
        tableSource.updateSchema()
                .addColumn("updatetime", Types.StringType.get())
                .commit();
    }
    
    //写，追加数据
    public static void appendWriteHadoopTableFromDataStreamRow(DataStream<Row> input, TableSchema schema, String path){
        TableLoader tableLoader = TableLoader.fromHadoopTable(path);
        FlinkSink.forRow(input,schema)
                .tableLoader(tableLoader)
                .build();
    }
}

2.12 修改catalog 类型为hive 的表结构

work.jiang.iceberg.util.IcebergUtil
修改catalog类型为hive的表结构与 hadoop类型类似，只是获取的catalog不一样

public static void changeSchemaHiveTable(){
        CatalogLoader hive_catalog = IcebergUtil.createHiveCatalog("hdfs://cdh01:8020/user/iceberg/hadoop_catalog");
        //获取表定义
        TableIdentifier tableIdentifier = TableIdentifier.of(Namespace.of("iceberg_db_hive"), "t_iceberg_sample_1");
        //获取表
        Table tableSource = hive_catalog.loadCatalog().loadTable(tableIdentifier);
        tableSource.updateSchema()
                .addColumn("updatetime", Types.StringType.get())
                .commit();
    }

2.13 小文件合并 catalog 类型为hadoop 的表

public class WriteIceberg {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        //合并小文件
        IcebergUtil.overWriteHadoopTableMiniFileToBigFile("hdfs://cdh01:8020/user/iceberg/hadoop_catalog/iceberg_db/t_iceberg_sample_2");
        env.execute();
    }

work.jiang.iceberg.util.IcebergUtil

//重写文件,Iceberg 提供 API，通过提交 flink 批处理作业将小文件重写为大文件
    public static void overWriteHadoopTableMiniFileToBigFile(String location){
        TableLoader tableLoader = TableLoader.fromHadoopTable(location);
        Table table = tableLoader.loadTable();
        RewriteDataFilesActionResult result = Actions.forTable(table)
                .rewriteDataFiles()
                .execute();
    }

2.14 小文件合并 catalog 类型为hive 的表

与catalog 为hadoop类型类似，只是获取tableLoader方式不一样

//重写文件,Iceberg 提供 API，通过提交 flink 批处理作业将小文件重写为大文件
    public static void overWriteHiveTableMiniFileToBigFile(String hivecatalogLocation,String databaseName,String tableName){
        TableLoader tableLoader = getHiveCatalogTableLoader(hivecatalogLocation, databaseName,tableName);
        Table table = tableLoader.loadTable();
        RewriteDataFilesActionResult result = Actions.forTable(table)
                .rewriteDataFiles()
                .execute();
    }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。