上篇文章我们探索了Apache ORC的发展史、当前Hadoop适配情况以及支持的数据类型。今天我们来看一下如何使用Apache ORC。
三、在Hive中使用
Hive可以说是ORC格式及程度最好的软件了。下面让我们看看如何在Hive里面使用ORC和一些相关的配置。
Hive中的语法
在Hive里面,如果您是新建表,那么只需要在表后增加“STORED AS ORC”语句即可。例如下面的表:istari。
CREATE TABLE istari (
name STRING,
color STRING
) STORED AS ORC;
如果要对现有的表或者表分区修改格式为ORC,可以直接使用ALTER语法,格式如下:
ALTER TABLE istari SET FILEFORMAT ORC;
另外,从Hive 0.14开始,用户可以通过使用CONCATENATE命令语法来手动合并ORC小文件。执行该命令后,不必重新序列化文件就能在类型级别进行合并。以下是语法参考:
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
如果想要获取ORC文件信息,可以使用Hive的orcfiledump命令。如下:
% hive --orcfiledump <path_to_file>
从Hive 1.1开始,命令增加了-d参数。如下:
% hive --orcfiledump -d <path_to_file>
Hive中的相关配置
在Hive中,有一些表属性或者库级别属性可以影响到ORC的行为,下面以列表的形式进行介绍。
表属性
以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行,用户可以确保所有客户端采用相同的设置进行数据存储。
属性 | 默认值 | 备注 |
orc.compress | zlib | high level compression = {NONE, ZLIB, SNAPPY} |
orc.compress.size | 262,144 | compression chunk size |
orc.stripe.size | 67,108,864 | memory buffer in bytes for writing |
orc.row.index.stride | 10,000 | number of rows between index entries |
orc.create.index | true | create indexes? |
orc.bloom.filter.columns | "" | comma separated list of column names |
orc.bloom.filter.fpp | 0.05 | bloom filter false positive rate |
下面这个例子展示了创建不带high level压缩的ORC表:
CREATE TABLE istari (
name STRING,
color STRING
) STORED AS ORC TBLPROPERTIES ("orc.compress"="NONE");
配置属性
在Hive的配置中,也有很多和ORC文件格式相关的配置项,下面我们通过表格的方式来统一查看一下:
属性 | 默认值 | 备注 |
hive.default.fileformat | TextFile | This is the default file format for new tables. If it is set to ORC, new tables will default to ORC. |
hive.stats.gather.num.threads | 10 | Number of threads used by partialscan/noscan analyze command for partitioned tables. This is applicable only for file formats that implement the StatsProvidingRecordReader interface (like ORC). |
hive.exec.orc.memory.pool | 0.5 | Maximum fraction of heap that can be used by ORC file writers. |
hive.exec.orc.write.format | NULL | Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the latest version. |
hive.exec.orc.default.stripe.size | 67,108,864 | Define the default size of ORC writer buffers in bytes. |
hive.exec.orc.default.block.size | 268,435,456 | Define the default file system block size for ORC files. |
hive.exec.orc.dictionary.key.size.threshold | 0.8 | If the number of keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1.0 to always use dictionary encoding. |
hive.exec.orc.default.row.index.stride | 10,000 | Define the default number of rows between row index entries. |
hive.exec.orc.default.buffer.size | 262,144 | Define the default ORC buffer size, in bytes. |
hive.exec.orc.default.block.padding | true | true Should ORC file writers pad stripes to minimize stripes that cross HDFS block boundaries. |
hive.exec.orc.block.padding.tolerance | 0.05 | Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, a maximum of 3.2Mb will be reserved for padding within the 256Mb block with the default hive.exec.orc.block.padding.tolerance. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will cross block boundaries and cause remote reads within a node local task. |
hive.exec.orc.default.compress | ZLIB | Define the default compression codec for ORC file. |
hive.exec.orc.encoding.strategy | SPEED | Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB). Possible options are SPEED and COMPRESSION. |
hive.orc.splits.include.file.footer | false | If turned on, splits generated by ORC will include metadata about the stripes in the file. This data is read remotely (from the client or HiveServer2 machine) and sent to all the tasks. |
hive.orc.cache.stripe.details.size | 10,000 | Cache size for keeping meta information about ORC splits cached in the client. |
hive.orc.compute.splits.num.threads | 10 | How many threads ORC should use to create splits in parallel. |
hive.exec.orc.skip.corrupt.data | false | If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw an exception. The default behavior is to throw an exception. |
hive.exec.orc.zerocopy | false | Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.) |
hive.merge.orcfile.stripe.level | true | When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance). |
hive.merge.orcfile.stripe.level | true | When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance). |
hive.orc.row.index.stride.dictionary.check | true | If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter. |
hive.exec.orc.compression.strategy | SPEED | Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec. Value can be SPEED or COMPRESSION. |
orc.write.variable.length.blocks | false | Should the ORC writer use HDFS variable length blocks, if they are available? If the new stripe would straddle a block, Hadoop is ≥ 2.7, and this is enabled, it will end the block before the new stripe. |
四、在Python中使用
在Python开发中,如果要使用ORC,可以使用Apache Arrow项目的PyArrow包,或者Dask包,以下是如何安装这两个包并展示一个使用例子:
PyArrow包安装和使用
安装语法:
pip3 install pyarrow==7.0.0
pip3 install pandas
读写ORC文件的例子:
In [1]: import pandas as pd
In [2]: import pyarrow as pa
In [3]: from pyarrow import orc
In [4]: orc.write_table(pa.table({"col1": [1, 2, 3]}), "test.orc")
In [5]: orc.read_table("test.orc").to_pandas()
Out[5]:
col1
0 1
1 2
2 3
Dask包安装和使用
安装语法:
pip3 install "dask[dataframe]==2022.2.0"
pip3 install pandas
读写ORC文件:
In [1]: import pandas as pd
In [2]: import dask.dataframe as dd
In [3]: pf = pd.DataFrame(data={"col1": [1, 2, 3]})
In [4]: dd.to_orc(dd.from_pandas(pf, npartitions=2), path="/tmp/orc")
Out[4]: (None,)
In [5]: dd.read_orc(path="/tmp/orc").compute()
Out[5]:
col1
0 1
1 2
2 3
五、在Spark中的使用
Apache Spark对ORC的集成也很好,下面就让我们看看在Spark里面如何使用ORC和一些相关的配置吧。
Spark中的语法
在Spark的建表语句中,你可以少写几个字母,只需要在语句最后增加USING ORC
即可:
CREATE TABLE istari (
name STRING,
color STRING
) USING ORC;
如果想获取ORC文件的信息,可以使用orc-tools
命令,如下:
% orc-tools meta <path_to_file>
如果要现实ORC文件的数据,使用如下命令:
% orc-tools data <path_to_file>
Spark中的相关配置
在Spark中与Hive中一样,也有一些表属性或者库级别属性可以影响到ORC的行为,下面以列表的形式进行介绍。
表属性
以ORC文件形式存储的表使用一些ORC表属性来控制存储行为。通过定义或修改这些表述行,用户可以确保所有客户端采用相同的设置进行数据存储。
属性 | 默认值 | 备注 |
orc.compress | ZLIB | high level compression = {NONE, ZLIB, SNAPPY, ZSTD} |
orc.compress.size | 262,144 | compression chunk size |
orc.stripe.size | 67,108,864 | memory buffer in bytes for writing |
orc.row.index.stride | 10,000 | number of rows between index entries |
orc.create.index | true | create indexes? |
orc.bloom.filter.columns | ”” | comma separated list of column names |
orc.bloom.filter.fpp | 0.05 | bloom filter false positive rate |
orc.key.provider | “hadoop” | key provider |
orc.encrypt | ”” | list of keys and columns to encrypt with |
orc.mask | ”” | masks to apply to the encrypted columns |
下面是一个带配置参数的ORC表的使用示例:
CREATE TABLE encrypted (
ssn STRING,
email STRING,
name STRING
)
USING ORC
OPTIONS (
hadoop.security.key.provider.path "kms://http@localhost:9600/kms",
orc.key.provider "hadoop",
orc.encrypt "pii:ssn,email",
orc.mask "nullify:ssn;sha256:email"
);
配置属性
在Spark的配置中,也有很多和ORC文件格式相关的配置项,下面我们通过表格的方式来统一查看一下:
属性 | 默认值 | 备注 |
spark.sql.orc.impl | native | The name of ORC implementation. It can be one of native or hive. native means the native ORC support. hive means the ORC library in Hive. |
spark.sql.orc.enableVectorizedReader | true | Enables vectorized orc decoding in native implementation. |
spark.sql.orc.mergeSchema | false | When true, the ORC data source merges schemas collected from all data files, otherwise the schema is picked from a random data file. |
spark.sql.hive.convertMetastoreOrc | true | Spark SQL will use the Hive SerDe for ORC tables instead of the built-in support. |