dbt docs generate 简单说明

原创

rongfengliang 2024-05-07 10:51:13 ©著作权

文章标签 List github 加载 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者rongfengliang的原创作品，请联系作者获取转载授权，否则将追究法律责任

dbt docs generate 核心是获取dbt 项目的元数据信息（包含了project 的）以及相关table的（dbt 模型相关的），然后通过提供的解析页面进行显示
目前是基于静态处理的（先生成，然后基于纯web 的解析渲染）对于展示方法很多，可以基于dbt 的docs serve 命令也可以基于自己的静态web server （nginx 或者s3），以下简单说明下

内部处理

实际处理是基于GenerateTask 类的，此类继承了CompileTask

cli 装饰器
参考如下，可以看到，依赖了profile，runtime_config 以及manifest，尽管write=False 但是实际manifest 文件也是写入的，因为docs 依赖

@requires.profile
@requires.project
@requires.runtime_config
@requires.manifest(write=False)

GenerateTask

此类主要的方法是run，里边包含了编译，copy 静态资源（web的），获取catalog（table）以及write_manifest 写入的，代码比较清晰

class GenerateTask(CompileTask):
    def run(self) -> CatalogArtifact:
        compile_results = None
        if self.args.compile:
            compile_results = CompileTask.run(self)
            if any(r.status == NodeStatus.Error for r in compile_results):
                fire_event(CannotGenerateDocs())
                return CatalogArtifact.from_results(
                    nodes={},
                    sources={},
                    generated_at=datetime.utcnow(),
                    errors=None,
                    compile_results=compile_results,
                )
 
        shutil.copyfile(
            DOCS_INDEX_FILE_PATH, os.path.join(self.config.project_target_path, "index.html")
        )
 
        for asset_path in self.config.asset_paths:
            to_asset_path = os.path.join(self.config.project_target_path, asset_path)
 
            if os.path.exists(to_asset_path):
                shutil.rmtree(to_asset_path)
 
            if os.path.exists(asset_path):
                shutil.copytree(asset_path, to_asset_path)
 
        if self.manifest is None:
            raise DbtInternalError("self.manifest was None in run!")
 
        adapter = get_adapter(self.config)
        with adapter.connection_named("generate_catalog"):
            fire_event(BuildingCatalog())
            catalog_table, exceptions = adapter.get_catalog(self.manifest)
 
        catalog_data: List[PrimitiveDict] = [
            dict(zip(catalog_table.column_names, map(dbt.utils._coerce_decimal, row)))
            for row in catalog_table
        ]
 
        catalog = Catalog(catalog_data)
 
        errors: Optional[List[str]] = None
        if exceptions:
            errors = [str(e) for e in exceptions]
 
        nodes, sources = catalog.make_unique_id_map(self.manifest)
        results = self.get_catalog_results(
            nodes=nodes,
            sources=sources,
            generated_at=datetime.utcnow(),
            compile_results=compile_results,
            errors=errors,
        )
 
        path = os.path.join(self.config.project_target_path, CATALOG_FILENAME)
        results.write(path)
        if self.args.compile:
            write_manifest(self.manifest, self.config.project_target_path)
 
        if exceptions:
            fire_event(WriteCatalogFailure(num_exceptions=len(exceptions)))
        fire_event(CatalogWritten(path=os.path.abspath(path)))
        return results
get_catalog 结合Manifest 获取catalog 处理
核心是编译Manifest 的schema 信息，之后通过数据库的查询获取实际的catalogs

def get_catalog(self, manifest: Manifest) -> Tuple[agate.Table, List[Exception]]:
    schema_map = self._get_catalog_schemas(manifest)
 
    with executor(self.config) as tpe:
        futures: List[Future[agate.Table]] = []
        for info, schemas in schema_map.items():
            if len(schemas) == 0:
                continue
            name = ".".join([str(info.database), "information_schema"])
 
            fut = tpe.submit_connected(
                self, name, self._get_one_catalog, info, schemas, manifest
            )
            futures.append(fut)
 
        catalogs, exceptions = catch_as_completed(futures)
 
    return catalogs, exceptions