Apache Arrow 简介

原创

禅与计算机程序设计艺术 2022-06-07 22:57:00 ©著作权

文章标签 python java 大数据编程语言数据库 文章分类 大数据

©著作权归作者所有：来自51CTO博客作者禅与计算机程序设计艺术的原创作品，请联系作者获取转载授权，否则将追究法律责任

Apache Arrow 简介_数据库

arrow主要focus在帮助 data 序列化, 以便在各种system之间transfer.
arrorw还解决了类型共享计算格式不统一的问题，是高性能计算的基础.

背景

https://arrow.apache.org/

Apache Arrow 简介_python_02

由于历史原因，Snowflake一直使用了JSON作为结果集(ResultSet)的序列化方式，引起了许多问题。首先，JSON的序列化/反序列化的成本实在是太高了：许多cpu cycle都被浪费在了字符串和其他数据类型之间的转换。
不仅仅是cpu，内存的消耗也是十分巨大的，尤其像是Java这样的语言，对内存的压力非常大。其次，使用JSON进行序列化，会导致某些数据类型(浮点数)的精度丢失。

经过一系列的研究，我们最终决定采用Apache Arrow作为我们新的结果集序列化方式。这篇文章对arrow进行了一些简单的介绍，并且反思了arrow想解决的一些问题。

Apache Arrow是什么

数据格式：arrow 定义了一种在内存中表示tabular data的格式。这种格式特别为数据分析型操作(analytical operation)进行了优化。比如说列式格式(columnar format)，能充分利用现代cpu的优势，进行向量化计算(vectorization)。不仅如此，Arrow还定义了IPC格式，序列化内存中的数据，进行网络传输，或者把数据以文件的方式持久化。
开发库：arrow定义的格式是与语言无关的，所以任何语言都能实现Arrow定义的格式。arrow项目为几乎所有的主流编程语言提供了SDK

说到这里，大家大概都明白了arrow其实和protobuf很像，只不过protobuf是为了structured data提供内存表示方式和序列化方案。可是两者的使用场景却很不一样。protobuf主要是序列化structured data，有很多的键值对和非常深的nested structure。arrow序列化的对象主要还是表格状数据。

What is Arrow?

Format

Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. The Arrow memory format also supports zero-copy reads for lightning-fast data access without serialization overhead.

Learn more about the design or read the specification.

Libraries

Arrow's libraries implement the format and provide building blocks for a range of use cases, including high performance analytics. Many popular projects use Arrow to ship columnar data efficiently or as the basis for analytic engines.

Libraries are available for C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. See how to install and get started.

Ecosystem

Apache Arrow is software created by and for the developer community. We are dedicated to open, kind communication and consensus decisionmaking. Our committers come from a range of organizations and backgrounds, and we welcome all to participate with us.

Learn more about how you can ask questions and get involved in the Arrow project.

内存表示: Arrow Columnar Format 列式存储

The columnar format has some key features:

Data adjacency for sequential access (scans)

O(1) (constant-time) random access

SIMD and vectorization-friendly

Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory

The Arrow columnar format provides analytical performance and data locality guarantees in exchange for comparatively more expensive mutation operations. This document is concerned only with in-memory data representation and serialization details; issues such as coordinating mutation of data structures are left to be handled by implementations.

arrow在内存中表示数据的最基本单元是array，它代表了一连串长度已知、类型相同的数据。而多个长度相同、类型相同或者不同的array就可以用来表示结果集(或者一部分的结果集)。

举一个简单的例子：一个如下图所示的结果集(或者table)

+------+------+                                 
|   C1 |   C2 |            [
|------+------|              DoubleArray: [ 1.11, 2.22, 3.33],
| 1.11 |  foo |     ->       StringArray: [ foo, bar, NULL]
| 2.22 |  bar |            ]
| 3.33 | NULL |   
+------+------+

就可以表示成一个大小为2的有序集合，集合中的array(DoubleArray 和 StringArray)长度为3。arrow限制了array的最大长度，当结果集(或者表)的大小超过了array的最大长度，就需要把结果集水平切分成多个有序集合。

接一下来我们具体来看一下array，arrow是这样定义一个array的：

逻辑类型(比如 int32 或者 timestamp)
一串buffer(用来存放具体的数据和表示NULL值)
array长度
array中NULL值的数量
dictionary (用于dictionary encoding，比较适用于有很多重复数据的array，相当于一个压缩算法，不是必需的)

举一个具体的例子，一个如下的Int32Array：

[1, null, 2, 4, 8]

会被表示成

* Length: 5, Null count: 1
* Validity bitmap buffer:

  |Byte 0 (validity bitmap) | Bytes 1-63            |
  |-------------------------|-----------------------|
  | 00011101                | 0 (padding)           |

* Value Buffer:

  |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
  |------------|-------------|-------------|-------------|-------------|-------------|
  | 1          | unspecified | 2           | 4           | 8           | unspecified |

总结一下：

具体的值存放在value buffer中，相对应的值如果是NULL的话，value buffer中的bytes可以是任意的值。
NULL值是用bitmap来表示，bit 0 表示NULL，bit 1表示非NULL值，value buffer中的值是有意义的。如果array中没有NULL值，这个bitmap也可以省略。
allocate memory on aligned addresses：每次分配内存的大小总是8或者64的倍数。注意我们仅仅需要一个字节就能表示这个array的NULL值，value buffer也仅仅需要20个字节，但是arrow为每个buffer都分配64个字节的内存大小。主要原因是便于编译器生成SIMD指令，进行向量化运算。网上有很多关于向量化运算的文章，有兴趣的小伙伴可以自行搜索一下。

Fixed-Size Primitive Type Array (e.g. Int32Array)是最简单的情况(这里也没有考虑dictionary encoding的情况)，StringArray 或者其他nested type array 的情况会更加复杂一些，具体的可以参见这里。

序列化与进程间通信(IPC)

之前已经提到了，多个长度相同的array组成的有序集合可以用来表示结果集的子集(或者部分的表)，arrow称这个有序集合为Record Batch。Record Batch也是序列化的基本单元。arrow定义了一个传输协议，能把多个record batch序列化成一个二进制的字节流，并且把这些字节流反序列化成record batch，从让数据能在不同的进程之间进行交换。

字节流由一连串的message组成，arrow定义了多种message type，主要是schema message和record batch message。一个schema message和多个record batch message就能完整的表示一个结果集(或者一个表)。message的format如下：

<continuation: 0xFFFFFFFF>
<metadata_size: int32>
<metadata_flatbuffer: bytes>
<padding>
<message body>

continuation indicator：8个字节，永远是0xFFFFFFFF，官方文档称是为了解决flatbuffer的alignment要求。
metadata_size：8个字节，保存了整个message的metadata序列化之后的字节数。
metadata_flatbufffer：metadata序列化之后的字节，arrow使用了flatbuffer对metadata进行了序列化，具体定义可在Message.fbs找到。
padding：padding data 使当前数据量是8的倍数
message body：schema message没有，record batch message 有。直接把内存中arrow array 的 value buffer 和 bitmap buffer 写入这里。

根据这些message，arrow定义了IPC Streaming Format，定义如下：

<SCHEMA MESSAGE>
<RECORD BATCH MESSAGE 0>
...
<RECORD BATCH MESSAGE n - 1>
<EOS [optional]: 0xFFFFFFFF 0x00000000>

由于所有record batch都有一样的schema，所以只需要序列化一个schema message。在反序列化的时候，根据schema message，就能重建所有的record batch。(这里并没有讨论dictionary encoding的情况)

反思

在传统的编程世界中，数据只存放与oltp database中(比如说MySQL)，application通过JDBC或者ODBC等标准接口和数据库进行交互。

然而虽然现在的互联网世界数据的爆炸，数据的使用场景也越来越复杂。arrow适用的场景可能有一下几个：

同一个系统，多个节点：由于云计算的普及，数据库上云也得到了越来越多的关注。在一个分布式数据库的实现中，可能会有许多的query executor节点并行产生结果集。arrow的格式可以让客户端并行读取各个节点产生的结果集。
多个系统可能会同时读取同一份数据：企业可能会需要data warehouse生成报表，需要spark做一些机器学习。为了能让不同的系统之间进行数据的交互，企业经常把数据以文件的形式存放于一些分布式的文件系统(AWS S3)之上。