上篇文章我们探索了Apache ORC的发展史、当前Hadoop适配情况以及支持的数据类型。今天我们来看一下如何使用Apache ORC。

三、在Hive中使用

Hive可以说是ORC格式及程度最好的软件了。下面让我们看看如何在Hive里面使用ORC和一些相关的配置。

hive建表指定orcfile格式 hive创建orc表_hive建表指定orcfile格式

Hive中的语法

在Hive里面,如果您是新建表,那么只需要在表后增加“STORED AS ORC”语句即可。例如下面的表:istari。

CREATE TABLE istari (
  name STRING,
  color STRING
) STORED AS ORC;

如果要对现有的表或者表分区修改格式为ORC,可以直接使用ALTER语法,格式如下:

ALTER TABLE istari SET FILEFORMAT ORC;

另外,从Hive 0.14开始,用户可以通过使用CONCATENATE命令语法来手动合并ORC小文件。执行该命令后,不必重新序列化文件就能在类型级别进行合并。以下是语法参考:

ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;

如果想要获取ORC文件信息,可以使用Hive的orcfiledump命令。如下:

% hive --orcfiledump <path_to_file>

从Hive 1.1开始,命令增加了-d参数。如下:

% hive --orcfiledump -d <path_to_file>

Hive中的相关配置

在Hive中,有一些表属性或者库级别属性可以影响到ORC的行为,下面以列表的形式进行介绍。

表属性

以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行,用户可以确保所有客户端采用相同的设置进行数据存储。

属性

默认值

备注

orc.compress

zlib

high level compression = {NONE, ZLIB, SNAPPY}

orc.compress.size

262,144

compression chunk size

orc.stripe.size

67,108,864

memory buffer in bytes for writing

orc.row.index.stride

10,000

number of rows between index entries

orc.create.index

true

create indexes?

orc.bloom.filter.columns

""

comma separated list of column names

orc.bloom.filter.fpp

0.05

bloom filter false positive rate

下面这个例子展示了创建不带high level压缩的ORC表:

CREATE TABLE istari (
  name STRING,
  color STRING
) STORED AS ORC TBLPROPERTIES ("orc.compress"="NONE");

配置属性

在Hive的配置中,也有很多和ORC文件格式相关的配置项,下面我们通过表格的方式来统一查看一下:

属性

默认值

备注

hive.default.fileformat

TextFile

This is the default file format for new tables. If it is set to ORC, new tables will default to ORC.

hive.stats.gather.num.threads

10

Number of threads used by partialscan/noscan analyze command for partitioned tables. This is applicable only for file formats that implement the StatsProvidingRecordReader interface (like ORC).

hive.exec.orc.memory.pool

0.5

Maximum fraction of heap that can be used by ORC file writers.

hive.exec.orc.write.format

NULL

Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the latest version.

hive.exec.orc.default.stripe.size

67,108,864

Define the default size of ORC writer buffers in bytes.

hive.exec.orc.default.block.size

268,435,456

Define the default file system block size for ORC files.

hive.exec.orc.dictionary.key.size.threshold

0.8

If the number of keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1.0 to always use dictionary encoding.

hive.exec.orc.default.row.index.stride

10,000

Define the default number of rows between row index entries.

hive.exec.orc.default.buffer.size

262,144

Define the default ORC buffer size, in bytes.

hive.exec.orc.default.block.padding

true

true Should ORC file writers pad stripes to minimize stripes that cross HDFS block boundaries.

hive.exec.orc.block.padding.tolerance

0.05

Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, a maximum of 3.2Mb will be reserved for padding within the 256Mb block with the default hive.exec.orc.block.padding.tolerance. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will cross block boundaries and cause remote reads within a node local task.

hive.exec.orc.default.compress

ZLIB

Define the default compression codec for ORC file.

hive.exec.orc.encoding.strategy

SPEED

Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB). Possible options are SPEED and COMPRESSION.

hive.orc.splits.include.file.footer

false

If turned on, splits generated by ORC will include metadata about the stripes in the file. This data is read remotely (from the client or HiveServer2 machine) and sent to all the tasks.

hive.orc.cache.stripe.details.size

10,000

Cache size for keeping meta information about ORC splits cached in the client.

hive.orc.compute.splits.num.threads

10

How many threads ORC should use to create splits in parallel.

hive.exec.orc.skip.corrupt.data

false

If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw an exception. The default behavior is to throw an exception.

hive.exec.orc.zerocopy

false

Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)

hive.merge.orcfile.stripe.level

true

When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).

hive.merge.orcfile.stripe.level

true

When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance).

hive.orc.row.index.stride.dictionary.check

true

If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter.

hive.exec.orc.compression.strategy

SPEED

Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec. Value can be SPEED or COMPRESSION.

orc.write.variable.length.blocks

false

Should the ORC writer use HDFS variable length blocks, if they are available? If the new stripe would straddle a block, Hadoop is ≥ 2.7, and this is enabled, it will end the block before the new stripe.

四、在Python中使用

hive建表指定orcfile格式 hive创建orc表_hive_02

在Python开发中,如果要使用ORC,可以使用Apache Arrow项目的PyArrow包,或者Dask包,以下是如何安装这两个包并展示一个使用例子:

PyArrow包安装和使用

安装语法:

pip3 install pyarrow==7.0.0
pip3 install pandas

读写ORC文件的例子:

In [1]: import pandas as pd

In [2]: import pyarrow as pa

In [3]: from pyarrow import orc

In [4]: orc.write_table(pa.table({"col1": [1, 2, 3]}), "test.orc")

In [5]: orc.read_table("test.orc").to_pandas()
Out[5]:
   col1
0     1
1     2
2     3

Dask包安装和使用

安装语法:

pip3 install "dask[dataframe]==2022.2.0"
pip3 install pandas

读写ORC文件:

In [1]: import pandas as pd

In [2]: import dask.dataframe as dd

In [3]: pf = pd.DataFrame(data={"col1": [1, 2, 3]})

In [4]: dd.to_orc(dd.from_pandas(pf, npartitions=2), path="/tmp/orc")
Out[4]: (None,)

In [5]: dd.read_orc(path="/tmp/orc").compute()
Out[5]:
   col1
0     1
1     2
2     3

五、在Spark中的使用

Apache Spark对ORC的集成也很好,下面就让我们看看在Spark里面如何使用ORC和一些相关的配置吧。

hive建表指定orcfile格式 hive创建orc表_big data_03

Spark中的语法

在Spark的建表语句中,你可以少写几个字母,只需要在语句最后增加USING ORC即可:

CREATE TABLE istari (
  name STRING,
  color STRING
) USING ORC;

如果想获取ORC文件的信息,可以使用orc-tools命令,如下:

% orc-tools meta <path_to_file>

如果要现实ORC文件的数据,使用如下命令:

% orc-tools data <path_to_file>

Spark中的相关配置

在Spark中与Hive中一样,也有一些表属性或者库级别属性可以影响到ORC的行为,下面以列表的形式进行介绍。

表属性

以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行,用户可以确保所有客户端采用相同的设置进行数据存储。

属性

默认值

备注

orc.compress

ZLIB

high level compression = {NONE, ZLIB, SNAPPY, ZSTD}

orc.compress.size

262,144

compression chunk size

orc.stripe.size

67,108,864

memory buffer in bytes for writing

orc.row.index.stride

10,000

number of rows between index entries

orc.create.index

true

create indexes?

orc.bloom.filter.columns

””

comma separated list of column names

orc.bloom.filter.fpp

0.05

bloom filter false positive rate

orc.key.provider

“hadoop”

key provider

orc.encrypt

””

list of keys and columns to encrypt with

orc.mask

””

masks to apply to the encrypted columns

下面是一个带配置参数的ORC表的使用示例:

CREATE TABLE encrypted (
  ssn STRING,
  email STRING,
  name STRING
)
USING ORC
OPTIONS (
  hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
  orc.key.provider "hadoop",
  orc.encrypt "pii:ssn,email",
  orc.mask "nullify:ssn;sha256:email"
);

配置属性

在Spark的配置中,也有很多和ORC文件格式相关的配置项,下面我们通过表格的方式来统一查看一下:

属性

默认值

备注

spark.sql.orc.impl

native

The name of ORC implementation. It can be one of native or hive. native means the native ORC support. hive means the ORC library in Hive.

spark.sql.orc.enableVectorizedReader

true

Enables vectorized orc decoding in native implementation.

spark.sql.orc.mergeSchema

false

When true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file.

spark.sql.hive.convertMetastoreOrc

true

Spark SQL will use the Hive SerDe for ORC tables instead of the built-in support.