HDF5 VOL Connector to Apache Arrow

翻译

mb62de8abf75c00 2023-05-06 01:13:09

文章标签 apache java 大数据 Apache 数据 文章分类 Python 后端开发

HDF5 VOL Connector to Apache Arrow论文翻译摘录：http://cs.iit.edu/~scs/assets/files/ye2021hdf5vol_abstract.pdf

HDF5 VOL Connector to Apache Arrow_数据

EXTENDED ABSTRACT

With the ever-increasing dataset sizes and volumes, various kinds of file formats such as Apache Parquet [1], ORC [2], Avro [3] and Apache Arrow [4], [5] have been developed to store data efficiently. In recent years, Apache Arrow is very popular in Big Data Analysis and Cloud Computing domain due to the standard columnar in-memory data representation and its efficient data processing and data transfer. The columnar data layout enables to take advantage of the SIMD (Single Instruction, Multiple Operations) operation in modern computers. It also reduces overhead of the copy-and-convert when moving the data from one system to another. As Apache Arrow is considered efficient and the data analysis can be accelerated with integration of it. Thus, there is a need to verify if the Apache Arrow can be used in High Performance Computing system.
随着数据集大小和数量的不断增加，已经开发了各种文件格式，如Apache Parquet[1]、ORC[2]、Avro[3]和Apache Arrow[4]、[5]，以有效地存储数据。近年来，Apache Arrow在大数据分析和云计算领域非常流行，因为它具有标准的列式内存数据表示以及高效的数据处理和数据传输。柱状数据布局能够利用现代计算机SIMD（单指令，多操作）操作。它还减少了将数据从一个系统移动到另一个系统时的复制和转换开销。由于Apache Arrow被认为是高效的，并且它的集成可以加速数据分析。因此，有必要验证Apache Arrow是否可以用于高性能计算系统。
However, most scientific applications currently prefer to using HDF5 [6], a widely used I/O middle-ware on HPC sytems, to store and manage data. HDF5 is designed to store and manage high-volume and complex scientific data. Although it supports a variety of features, like random access to individual objects, partial access to selected dataset regions and internal data compression, it doesn’t support column storage and inefficient for column-access.
然而，目前大多数科学应用程序更喜欢使用HDF5[6]来存储和管理数据，HDF5[6]是HPC系统上广泛使用的I/O中间件。HDF5旨在存储和管理高容量和复杂的科学数据。尽管它支持多种功能，如对单个对象的随机访问、对选定数据集区域的部分访问和内部数据压缩，但它不支持列存储，并且列访问效率低下。
As mentioned above, Apache Arrow can create an efficient in-memory column store that can be used to manage streamed data. Accessing this data through HDF5 calls would applications to take advantage of transient, column-oriented data streams, such as real-time data through high-speed instruments and cameras. Moreover, bridging the gap between science applications and analytic tools that use HDF5 and Apache Arrow could bring new kinds of kinds of data together. Therefore, there is a need to create a tool to support accessing Apache Arrow data through native HDF5 calls without changing the applications. Thus, the objects of this work are:
如上所述，Apache Arrow可以创建一个高效的内存列存储，用于管理流式数据。通过HDF5调用访问这些数据将使应用程序能够利用瞬态的、面向列的数据流，例如通过高速仪器和相机的实时数据。此外，弥合科学应用程序和使用HDF5和Apache Arrow的分析工具之间的差距，可以将新类型的数据结合在一起。因此，需要创建一个工具来支持在不更改应用程序的情况下通过本机HDF5调用访问Apache Arrow数据。因此，这项工作的目的是：

Design and implement a HDF5 VOL connector which allows applications to access Apache Arrow data through native HDF5 calls. 设计并实现一个HDF5 VOL连接器，该连接器允许应用程序通过本机HDF5调用访问Apache Arrow数据。
Explore its use for analyzing scientific data 探索其用于分析科学数据

BACKGROUND

HDF5

HDF5 is a well-established and very flexible data model, parallel I/O library and file format, which could handle many store options and needs. HDF5 is designed to store and manage high-volume and complex scientific data. It provides a rich set of pre-defined datatypes as well as an unlimited variety of complex user-defined datatypes. It also supports a lot of powerful features for managing data, such as random access to individual objects, partial access to selected dataset regions and internal data compression. Due to its portability,
efficiency and the flexibleness of its data model, HDF5 has been widely-used by a great number of scientific and industry applications in HPC community to store and manage the data they produced. By default, HDF5 library use its native file format when storing data and takes advantage of MPI-IO to perform parallel I/O.
HDF5是一个完善且非常灵活的数据模型、并行I/O库和文件格式，可以处理许多存储选项和需求。HDF5旨在存储和管理高容量和复杂的科学数据。它提供了一组丰富的预定义数据类型，以及各种各样的复杂用户定义数据类型。它还支持许多强大的数据管理功能，如对单个对象的随机访问、对选定数据集区域的部分访问和内部数据压缩。由于其可移植性、高效性和数据模型的灵活性，HDF5已被HPC社区的大量科学和工业应用程序广泛用于存储和管理他们产生的数据。默认情况下，HDF5库在存储数据时使用其本机文件格式，并利用MPI-IO执行并行I/O。
With the emergence of new file format and storage system which is not comply with the POSIX I/O standard, there is a need to support the new file format and storage system without modifying the applications’ code. To provide this capability and allow developers to store data with more choices, HDF5 library introduces a Virtual Object Layer (VOL) released in 1.12 version of the library. VOL is a storage abstraction layer within the HDF5 library and is implemented just below the public HDF5 API. The VOL enables applications to store the HDF5 data in many different storage (like storing the data in PDC, DAO, Hermes, etc.) by intercepting the HDF5 I/O API calls and then seamlessly re-routing them to the corresponding VOL connector backend, which could translate these calls into the operations that it desires to perform.
随着不符合POSIX I/O标准的新文件格式和存储系统的出现，需要在不修改应用程序代码的情况下支持新的文件格式和存储器系统。为了提供这一功能并允许开发人员存储具有更多选择的数据，HDF5库引入了该库1.12版本中发布的虚拟对象层（VOL）。VOL是HDF5库中的存储抽象层，在公共HDF5 API的下面实现。VOL通过拦截HDF5 I/O API调用，然后无缝地将其重新路由到相应的VOL连接器后端，从而使应用程序能够将HDF5数据存储在许多不同的存储中（如将数据存储在PDC、DAO、Hermes等中），从而将这些调用转换为它想要执行的操作。

Apache Arrow

Apache Arrow is an open source, columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real-time. It specifies a standard columnar in-memory format foe representing the structured, table-like datasets. In most cases, Apache Arrow acts as an interface between different computer programming languages and systems. With the columnar in-memory data layout for memory processing, it could process large amounts of data quickly by using SIMD (Single Instruction, Multiple Data) operations. Moreover, due to the standard column format, it reduces the unnecessary overhead copy-and-covert cost when moving data from one system to another. Apache Arrow has a rich set of data types, including nested and user-defined data types. It also creates a in-memory Plasma Object Store [7] for different applications to share data within the same node and makes use of Arrow Flight [8], an RPC framework, for high-performance data services based on Arrow data. All of this benefits and advantages make Apache Arrow very popular in Big Data Analysis Area.
Apache Arrow是一种开源、柱状的内存数据表示，使分析系统和数据源能够实时交换和处理数据。它指定了一个标准的列式内存格式，用于表示结构化的、类似表格的数据集。在大多数情况下，ApacheArrow充当不同计算机编程语言和系统之间的接口。利用用于内存处理的列式内存数据布局，它可以通过使用SIMD（单指令，多数据）操作快速处理大量数据。此外，由于采用了标准列格式，在将数据从一个系统移动到另一个系统时，它减少了不必要的开销拷贝和隐蔽成本。ApacheArrow有一组丰富的数据类型，包括嵌套和用户定义的数据类型。它还创建了一个内存中的Plasma对象存储[7]，用于不同的应用程序在同一节点内共享数据，并利用Arrow Flight[8]，一个RPC框架，用于基于Arrow数据的高性能数据服务。所有这些好处和优势使Apache Arrow在大数据分析领域非常受欢迎。

ARROW VOL CONNECTOR DESIGN AND IMPLEMENTATION

HDF5 VOL Connector to Apache Arrow_大数据_02

According to the introduction of Apache Arrow, we already know that Apache Arrow could create an in-memory store and is considered as efficient to store and manage streamed data. Accessing this data through HDF5 API would allow applications to take advantage of the transient, columnoriented data streams, such as real-time data from high-speed scientific instruments and cameras. Therefore, there is a need to support accessing Arrow data through HDF5 calls. Figure 1 shows the Apache Arrow location within VOL. It is a terminal VOL connector, locating at the last layer of all the VOLs. Arrow VOL connector will intercept the related HDF5 I/O API calls and then translates them into Apache Arrow API and saves the data as Apache Arrow tables.

根据Apache Arrow的介绍，我们已经知道Apache Arrow可以创建内存存储，并且被认为可以高效地存储和管理流式数据。通过HDF5 API访问这些数据将允许应用程序利用瞬态、面向列的数据流，例如来自高速科学仪器和相机的实时数据。因此，需要支持通过HDF5调用访问Arrow数据。图1显示了VOL中的Apache Arrow位置。它是一个终端VOL连接器，位于所有VOL的最后一层。Arrow VOL连接器将截获相关的HDF5 I/O API调用，然后将其转换为Apache Arrow API，并将数据保存为Apache-Arrow表。

HDF5 VOL Connector to Apache Arrow_大数据_03

Currently, our Arrow VOL connector only implements a subset of the HDF5 API. Apache Arrow itself is not an engine or storage, it is only a columnar data representation format. Therefore, in our implementation, HDF5 files and groups are mapped to directories and responding sub-directories, while HDF5 datasets are mapped to Apache Arrow tables. In addition, as Apache Arrow doesn’t support multiple processes to write part of a single table, each MPI process will write its sepcified subset region as an arrow table to the back-end storage (including parallel file system and Apache Arrow Plasma in-memory Object Store) or Arrow Flight Server. Figure 2 presents the internal work-flow of write and read operations in Arrow VOL connector.

目前，我们的Arrow VOL连接器仅实现HDF5 API的一个子集。Apache Arrow本身不是一个引擎或存储，它只是一种列式数据表示格式。因此，在我们的实现中，HDF5文件和组被映射到目录和响应子目录，而HDF5数据集被映射到Apache Arrow表。此外，由于Apache Arrow不支持多个进程写入单个表的一部分，每个MPI进程都会将其指定的子集区域作为箭头表写入后端存储（包括并行文件系统和Apache Arrow Plasma内存对象存储）或Arrow Flight Server。图2显示了Arrow VOL连接器中写入和读取操作的内部工作流程。

Write Operation: Once Arrow VOL connector intercepts a H5Dwrite() request, it creates an internal column-major buffer and then fill the row-major input data into the internal buffer. This step requires the row-major to column-major conversion. Both the metadata and the internal column-major buffer are kept in memory. 写入操作：一旦Arrow VOL连接器截获H5Dwrite（）请求，它就会创建一个内部列主缓冲区，然后将行主输入数据填充到内部缓冲区中。此步骤需要行主旋律到列主旋律的转换。元数据和内部列主缓冲区都保存在内存中。
Close Operation: When Arrow VOL connector intercepts a H5Dclose() request, it will create a corresponding Arrow table by the internal buffer and then flush the metadata and the Arrow table into the selected back-end storage. 当Arrow VOL连接器截获H5Dclose（）请求时，它将通过内部缓冲区创建相应的Arrow表，然后将元数据和Arrow表刷新到选定的后端存储中。
Read Operation: When intercepting a H5Dread() request, Arrow VOL connector will first check if the data is exist in the internal column-major buffer. If the request data has already existed, Arrow VOL connector will fill the output buffer directly through the internal column-major buffer. If request data isn’t in the internal column-major buffer, it will first load the data from the back-end storage to the Arrow table and fill into the internal buffer. Then it will fill the output buffer by this internal columnmajor buffer. As with the write operation, this step needs the column-major to row-major conversion no matter the request data exists in the internal buffer or not. 当拦截H5Dread（）请求时，Arrow VOL连接器将首先检查数据是否存在于内部列主缓冲区中。如果请求数据已经存在，Arrow VOL连接器将通过内部列主缓冲区直接填充输出缓冲区。如果请求数据不在内部列主缓冲区中，它将首先将数据从后端存储加载到Arrow表，然后填充到内部缓冲区中。然后，它将通过这个内部列主缓冲区填充输出缓冲区。与写入操作一样，无论请求数据是否存在于内部缓冲区中，此步骤都需要进行列主到行主的转换。

INITIAL RESULTS

HDF5 VOL Connector to Apache Arrow_数据_04

Testbed: All tests were conducted on the Cori Supercomputer at the National Energy Research Scientific Computing Center (NERSC), which is a Cray XC40 supercomputer with 1630 Intel Xeon Haswell nodes. Each node consists of 32 CPU cores and 128GB memory. The supporting storage system is Lustre, an extensively used parallel file system. It has 248 object storage targets (OSTs) and is shared by all users. Software used: The Arrow VOL connector implementation depends on HDF5 (v1.13.0), Apache Arrow (v4.0.1). The MPICH version used for parallel I/O processing is 3.3.1 and the GCC version is 7.3.0.
Analysis: In our experiment, we evaluated the write performance by VPIC-IO, a plasma-physics application’s I/O kernel, and read performance through BD-CAST I/O kernel, which is used for analyzing the data produced by particle simulation. All the experiments were executed on 4 nodes with 128 processes. In VPIC-IO, each MPI process writes a region with different number of particles (such as 1M, 2M and 4M) and each particle has 8 properties. The particles are organized as a 1D-array. Figure 3(a) shows the write performance with arrow-vol and without arrow-vol. We can see that the raw write rate with arrow-vol is about 9 GB/sec while the raw write rate with native hdf5 is only hundreds of Megabytes per second. One reason is because of the different configuration of the file system stripe count for native hdf5 and arrow-vol. The other reason is that native hdf5 needs to flush data into the file system when executing the H5write() operations while the data is saved in memory when using Arrow Vol connector. Figure 3(b) shows the read performance with arrow-vol and without hdf5. We can see that the performance of without arrow-vol far exceeds that of with arrow-vol. That’s probably because the test scale is small and part of the data has been cached in memory.
试验台：所有测试都是在国家能源研究科学计算中心的科里超级计算机上进行的，这是一台Cray XC40超级计算机，具有1630个Intel Xeon Haswell节点。每个节点由32个CPU核心和128GB内存组成。支持的存储系统是Lustre，这是一种广泛使用的并行文件系统。它有248个对象存储目标（OST），由所有用户共享。使用的软件：Arrow VOL连接器的实现依赖于HDF5（v1.13.0）和Apache Arrow（v4.0.1）。用于并行I/O处理的MPICH版本为3.3.1，GCC版本为7.3.0。
分析：在我们的实验中，我们通过等离子体物理应用程序的I/O内核VPIC-IO评估了写入性能，并通过用于分析粒子模拟产生的数据的BD-CAST I/O内核评估了读取性能。所有实验都在4个节点上执行，共有128个进程。在VPIC-IO中，每个MPI过程都会写入一个具有不同数量粒子（如1M、2M和4M）的区域，并且每个粒子都有8个属性。粒子被组织为1D阵列。图3（a）显示了有箭头vol和没有箭头vol的写入性能。我们可以看到，箭头vol的原始写入速率约为9 GB/秒，而原生hdf5的原始写入速度仅为每秒数百兆字节。其中一个原因是本机hdf5和arrow-vol的文件系统条带计数配置不同。另一个原因是，当执行H5write（）操作时，本机hdf5需要将数据刷新到文件系统中，而当使用Arrow-Vol连接器时，数据会保存在内存中。图3（b）显示了使用箭头vol和不使用hdf5时的读取性能。我们可以看到，无箭头vol的性能远远超过有箭头vol。这可能是因为测试规模很小，而且部分数据已经缓存在内存中。

CONCLUSIONS

In this work, we designed and implemented a HDF5 VOL connector to Apache Arrow that enables science applications to access Apache Arrow data through native HDF5 calls without changing the original code. We also have seen the initial write/read performance results when using Arrow-VOL Connector and native HDF5. Although the performance is not very good, there is still a lot of room for optimization. Most importantly, we have verified that Apache Arrow can be integrated into HPC system, which laid the foundation for our future work, like the integration of HPC and Big Data Analysis.
在这项工作中，我们设计并实现了一个到Apache Arrow的HDF5 VOL连接器，使科学应用程序能够通过本机HDF5调用访问Apache Arrow数据，而无需更改原始代码。我们还看到了使用Arrow VOL连接器和本机HDF5时的初始写/读性能结果。虽然性能不是很好，但仍有很大的优化空间。最重要的是，我们已经验证了ApacheArrow可以集成到HPC系统中，这为我们未来的工作奠定了基础，比如HPC和大数据分析的集成。

[1] “Apache parquet.” [Online]. Available: https://parquet.apache.org/
[2] “Apache orc.” [Online]. Available: https://orc.apache.org/
[3] “Apache avro.” [Online]. Available: https://avro.apache.org/docs/current/
[4] “Apache arrow.” [Online]. Available: https://arrow.apache.org/
[5] J. Chakraborty, I. Jimenez, S. A. Rodriguez, A. Uta, J. LeFevre, and C. Maltzahn, “Towards an arrow-native storage system,” arXiv preprint arXiv:2105.09894, 2021.
[6] “Hdf5 library.” [Online]. Available: https://portal.hdfgroup.org/display/HDF5/HDF5
[7] “Apache arrow plasma in-memory object store.” [Online]. Available: https://arrow.apache.org/docs/python/plasma.html
[8] “Apache arrow flight rpc framework.” [Online]. Available: https://arrow.apache.org/docs/format/Flight.html