java实现clickhouse批量写入 clickhouse写入过程

转载

doscommand 2024-04-09 13:04:39

文章标签 clickhouse 数据库 sql 物化视图 sed 文章分类 Java 后端开发

本文对 ClickHouse 物化视图的写入流程源码做个详细说明，基于 v22.8.14.53-lts 版本。

StorageMaterializedView

首先来看物化视图的构造函数：

StorageMaterializedView::StorageMaterializedView(
    const StorageID & table_id_,
    ContextPtr local_context,
    const ASTCreateQuery & query,
    const ColumnsDescription & columns_,
    bool attach_,
    const String & comment)
    : IStorage(table_id_), WithMutableContext(local_context->getGlobalContext())
{
    StorageInMemoryMetadata storage_metadata;
    storage_metadata.setColumns(columns_);

    ......

    if (!has_inner_table)
    {
        target_table_id = query.to_table_id;
    }
    else if (attach_)
    {
        /// If there is an ATTACH request, then the internal table must already be created.
        target_table_id = StorageID(getStorageID().database_name, generateInnerTableName(getStorageID()), query.to_inner_uuid);
    }
    else
    {
        /// We will create a query to create an internal table.
        auto create_context = Context::createCopy(local_context);
        auto manual_create_query = std::make_shared<ASTCreateQuery>();
        manual_create_query->setDatabase(getStorageID().database_name);
        manual_create_query->setTable(generateInnerTableName(getStorageID()));
        manual_create_query->uuid = query.to_inner_uuid;

        auto new_columns_list = std::make_shared<ASTColumns>();
        new_columns_list->set(new_columns_list->columns, query.columns_list->columns->ptr());

        manual_create_query->set(manual_create_query->columns_list, new_columns_list);
        manual_create_query->set(manual_create_query->storage, query.storage->ptr());

        InterpreterCreateQuery create_interpreter(manual_create_query, create_context);
        create_interpreter.setInternal(true);
        create_interpreter.execute();

        target_table_id = DatabaseCatalog::instance().getTable({manual_create_query->getDatabase(), manual_create_query->getTable()}, getContext())->getStorageID();
    }
}

通过以上代码可以发现物化视图支持几种创建语法，总的来说可以归为 3 类：

指定了目的表的情况：

create table src(id Int32) Engine=Memory();
create table dest(id Int32) Engine=Memory();

create materialized view mv to dest as select * from src;

使用以上形式时，target_table_id 会选择 dest 表的 table_id。

不指定目的表的情况：

create table src(id Int32) Engine=Memory();

create materialized view mv Engine=Memory() as select * from src;

使用以上形式时，首先会根据源表的 table_id 生成一个以 .inner. 开头的目的表名，如 .inner.5ef4ec2c-efb1-4918-bf6c-59de2edb54cf，然后在生成一个随机的 uuid 作为目的表的 table_id 并同时作为 target_table_id 。

第 3 种其实不是创建语法，而是在 ClickHouse 启动或者物化视图被 detach 掉后，执行 attach 的实现。

StorageMaterializedView::read

void StorageMaterializedView::read(
    QueryPlan & query_plan,
    const Names & column_names,
    const StorageSnapshotPtr & storage_snapshot,
    SelectQueryInfo & query_info,
    ContextPtr local_context,
    QueryProcessingStage::Enum processed_stage,
    const size_t max_block_size,
    const size_t num_streams)
{
    /// 获取目的表实例
    auto storage = getTargetTable();
    auto lock = storage->lockForShare(local_context->getCurrentQueryId(), local_context->getSettingsRef().lock_acquire_timeout);
    auto target_metadata_snapshot = storage->getInMemoryMetadataPtr();
    auto target_storage_snapshot = storage->getStorageSnapshot(target_metadata_snapshot, local_context);

    if (query_info.order_optimizer)
        query_info.input_order_info = query_info.order_optimizer->getInputOrder(target_metadata_snapshot, local_context);

    storage->read(query_plan, column_names, target_storage_snapshot, query_info, local_context, processed_stage, max_block_size, num_streams);

    if (query_plan.isInitialized())
    {
        /// 获取物化视图 stream 中对应的 block 结构
        auto mv_header = getHeaderForProcessingStage(column_names, storage_snapshot, query_info, local_context, processed_stage);
        /// 获取查询语句中所需的列对应的 block 结构
        auto target_header = query_plan.getCurrentDataStream().header;

        /// 从查询的列中去除那些mv不存在的列
        removeNonCommonColumns(mv_header, target_header);

        /// 分布式表引擎在查询处理到指定阶段，header 中可能不包含物化视图中的所有列，例如 group by
        /// 所以从 mv_header 中去除那些查询不需要的列
        removeNonCommonColumns(target_header, mv_header);

        /// 当查询中得到的 mv_header 和 target_header 有不同结构时，会通过在 pipeline 中添加表达式计算来进行转换
        /// 比如 Decimal(38, 6) -> Decimal(16, 6)，或者一些聚合运算，如 sum 等
        if (!blocksHaveEqualStructure(mv_header, target_header))
        {
            auto converting_actions = ActionsDAG::makeConvertingActions(target_header.getColumnsWithTypeAndName(),
                                                                        mv_header.getColumnsWithTypeAndName(),
                                                                        ActionsDAG::MatchColumnsMode::Name);
            auto converting_step = std::make_unique<ExpressionStep>(query_plan.getCurrentDataStream(), converting_actions);
            converting_step->setStepDescription("Convert target table structure to MaterializedView structure");
            query_plan.addStep(std::move(converting_step));
        }

        query_plan.addStorageHolder(storage);
        query_plan.addTableLock(std::move(lock));
    }
}

通过以上代码可以看出，物化视图是一种逻辑描述，数据都是存储在目的表中，读取时实际操作的目的表，并且在在查询过程中还会涉及到多阶段 block 的转换，以及表达式的计算。

StorageMaterializedView::write

SinkToStoragePtr StorageMaterializedView::write(const ASTPtr & query, const StorageMetadataPtr & /*metadata_snapshot*/, ContextPtr local_context)
{
    auto storage = getTargetTable();
    auto lock = storage->lockForShare(local_context->getCurrentQueryId(), local_context->getSettingsRef().lock_acquire_timeout);

    auto metadata_snapshot = storage->getInMemoryMetadataPtr();
    auto sink = storage->write(query, metadata_snapshot, local_context);

    sink->addTableLock(lock);
    return sink;
}

同样写也是将数据存入了目的表。

我们都知道数据写源表时会触发写物化视图，从而将数据写入目的表，下面就看一下是如何实现的。SQL 的执行都是通过 IInterpreter 到 InterpreterXxx 的，这里就不再多说，一个写入操作最中会调用 InterpreterInsertQuery，所以从 InterpreterInsertQuery::execute() 开始跟踪。

InterpreterInsertQuery::execute()

BlockIO InterpreterInsertQuery::execute()
{
    ......
    std::vector<Chain> out_chains;
    if (!distributed_pipeline || query.watch)
    {
        size_t out_streams_size = 1;
        ......
        for (size_t i = 0; i < out_streams_size; ++i)
        {
            auto out = buildChainImpl(table, metadata_snapshot, query_sample_block, nullptr, nullptr);
            out_chains.emplace_back(std::move(out));
        }
    }
    ......
}

execute() 中通过 buildChainImpl() 来构建输出链， buildChainImpl() 会判断当前表是否有物化视图关联，如果有就会调用 buildPushingToViewsChain() 。

buildPushingToViewsChain()

这个方法非常长，这里只展示和本文想说明的问题相关的部分。

Chain buildPushingToViewsChain(
    const StoragePtr & storage,
    const StorageMetadataPtr & metadata_snapshot,
    ContextPtr context,
    const ASTPtr & query_ptr,
    bool no_destination,
    ThreadStatusesHolderPtr thread_status_holder,
    std::atomic_uint64_t * elapsed_counter_ms,
    const Block & live_view_header)
{
    ......
    
    auto table_id = storage->getStorageID();
    auto views = DatabaseCatalog::instance().getDependentViews(table_id);

    ......

    std::vector<Chain> chains;

    for (const auto & view_id : views)
    {
        auto view = DatabaseCatalog::instance().tryGetTable(view_id, context);
        
        ......

        if (auto * materialized_view = dynamic_cast<StorageMaterializedView *>(view.get()))
        {
            ......
            
            StoragePtr inner_table = materialized_view->getTargetTable();
            auto inner_table_id = inner_table->getStorageID();
            auto inner_metadata_snapshot = inner_table->getInMemoryMetadataPtr();
            query = view_metadata_snapshot->getSelectQuery().inner_query;
            target_name = inner_table_id.getFullTableName();

            Block header;

            /// Get list of columns we get from select query.
            if (select_context->getSettingsRef().allow_experimental_analyzer)
                header = InterpreterSelectQueryAnalyzer::getSampleBlock(query, select_context);
            else
                header = InterpreterSelectQuery(query, select_context, SelectQueryOptions().analyze()).getSampleBlock();

            /// Insert only columns returned by select.
            Names insert_columns;
            const auto & inner_table_columns = inner_metadata_snapshot->getColumns();
            for (const auto & column : header)
            {
                /// But skip columns which storage doesn't have.
                if (inner_table_columns.hasPhysical(column.name))
                    insert_columns.emplace_back(column.name);
            }

            InterpreterInsertQuery interpreter(nullptr, insert_context, false, false, false);
            out = interpreter.buildChain(inner_table, inner_metadata_snapshot, insert_columns, thread_status_holder, view_counter_ms);
            out.addStorageHolder(view);
            out.addStorageHolder(inner_table);
        }
        else if (auto * live_view = dynamic_cast<StorageLiveView *>(view.get()))
        {
            runtime_stats->type = QueryViewsLogElement::ViewType::LIVE;
            query = live_view->getInnerQuery(); // Used only to log in system.query_views_log
            out = buildPushingToViewsChain(
                view, view_metadata_snapshot, insert_context, ASTPtr(), true, thread_status_holder, view_counter_ms, storage_header);
        }
        else if (auto * window_view = dynamic_cast<StorageWindowView *>(view.get()))
        {
            runtime_stats->type = QueryViewsLogElement::ViewType::WINDOW;
            query = window_view->getMergeableQuery(); // Used only to log in system.query_views_log
            out = buildPushingToViewsChain(
                view, view_metadata_snapshot, insert_context, ASTPtr(), true, thread_status_holder, view_counter_ms);
        }
        else
            out = buildPushingToViewsChain(
                view, view_metadata_snapshot, insert_context, ASTPtr(), false, thread_status_holder, view_counter_ms);

        ......
}

buildPushingToViewsChain() 会检查当前表是否有视图依赖，通过几个判断可以看出视图分为三种：物化视图、实时视图和窗口视图，最后的 else 是指当前表是个普通表。如果当前表是源表且有物化视图依赖，就会调用 buildPushingToViewsChain() 来构建链，这是个递归调用，首次进入当前表是普通表，其依赖的物化视图会再次调用该方法，再次进入就会物化视图的 if 逻辑，最终是通过 buildChain() 来构建链。

buildChainImpl

buildChain() 中是调用了 buildChainImpl() 这个实现类。

Chain InterpreterInsertQuery::buildChainImpl(
    const StoragePtr & table,
    const StorageMetadataPtr & metadata_snapshot,
    const Block & query_sample_block,
    ThreadStatusesHolderPtr thread_status_holder,
    std::atomic_uint64_t * elapsed_counter_ms)
{
    ......
    /// We create a pipeline of several streams, into which we will write data.
    Chain out;

    /// Keep a reference to the context to make sure it stays alive until the chain is executed and destroyed
    out.addInterpreterContext(context_ptr);

    /// NOTE: we explicitly ignore bound materialized views when inserting into Kafka Storage.
    ///       Otherwise we'll get duplicates when MV reads same rows again from Kafka.
    if (table->noPushingToViews() && !no_destination)
    {
        auto sink = table->write(query_ptr, metadata_snapshot, context_ptr);
        sink->setRuntimeData(thread_status, elapsed_counter_ms);
        out.addSource(std::move(sink));
    }
    else
    {
        out = buildPushingToViewsChain(table, metadata_snapshot, context_ptr, query_ptr, no_destination, thread_status_holder, elapsed_counter_ms);
    }

    ......
}

buildChainImpl() 会根据当前表（或视图）是否有依赖的视图或目的表，来做不同的操作，这里就可以处理视图级连视图的情况，会不断递归构造相应的链节点，使之连接起来。

Chain InterpreterInsertQuery::buildChainImpl(
    const StoragePtr & table,
    const StorageMetadataPtr & metadata_snapshot,
    const Block & query_sample_block,
    ThreadStatusesHolderPtr thread_status_holder,
    std::atomic_uint64_t * elapsed_counter_ms)
{
    ...

    /// We create a pipeline of several streams, into which we will write data.
    Chain out;

    /// Keep a reference to the context to make sure it stays alive until the chain is executed and destroyed
    out.addInterpreterContext(context_ptr);

    /// NOTE: we explicitly ignore bound materialized views when inserting into Kafka Storage.
    ///       Otherwise we'll get duplicates when MV reads same rows again from Kafka.
    if (table->noPushingToViews() && !no_destination)  // table->noPushingToViews() 用于禁止物化视图插入数据到 KafkaEngine
    {
        auto sink = table->write(query_ptr, metadata_snapshot, context_ptr);
        sink->setRuntimeData(thread_status, elapsed_counter_ms);
        out.addSource(std::move(sink));
    }
    else  // 构建物化视图插入 pushingToViewChain，重点！！！
    {
        out = buildPushingToViewsChain(table, metadata_snapshot, context_ptr, query_ptr, no_destination, thread_status_holder, elapsed_counter_ms);
    }

    ...

    return out;
}

小结

所以源表和物化视图在写入时是构造了多个输出链，数据也是只能对当前写入的数据做操作，不会影响源表现有数据。而且写入源表和目的表的过程是一个 pipeline，需要全部完成才算写入成功，当然 pipeline 可以并行处理，可以加快写入速度。

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：Redis 图形客户端Linux redis 地图

下一篇：微信小程序怎么获取URL Schema 的参数小程序的url可以获取么

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯