spark sql和hive sparksql和hive兼容吗

转载

mob6454cc6ba5a5 2024-01-11 19:48:46

文章标签 spark sql和hive spark 大数据 Hive SQL 文章分类 Spark 大数据

文章目录

Compatibility with Apache Hive

Deploying in Existing Hive Warehouses

Supported Hive Features
Unsupported Hive Functionality
Incompatible Hive UDF

Compatibility with Apache Hive

Spark SQL 旨在与 Hive Metastore、SerDes 和 UDF 兼容。目前 Hive SerDes 和 UDF 基于 Hive 1.2.1，Spark SQL 可以连接到不同版本的 Hive Metastore（从 0.12.0 到 2.1.1。另见与不同版本的 Hive Metastore 交互）。

Deploying in Existing Hive Warehouses

Spark SQL Thrift JDBC 服务器旨在与现有 Hive 安装“开箱即用”兼容。您不需要修改现有的 Hive Metastore 或更改表的数据放置或分区。

Supported Hive Features

Spark SQL 支持绝大多数 Hive 功能，例如：

Hive query statements, including:

SELECT
GROUP BY
ORDER BY
CLUSTER BY
SORT BY

All Hive operators, including:

Relational operators (=, ⇔, ==, <>, <, >, >=, <=, etc)
Arithmetic operators (+, -, *, /, %, etc)
Logical operators (AND, &&, OR, ||, etc)
Complex type constructors
Mathematical functions (sign, ln, cos, etc)
String functions (instr, length, printf, etc)

User defined functions (UDF)
User defined aggregation functions (UDAF)
User defined serialization formats (SerDes)
Window functions
Joins

JOIN
{LEFT|RIGHT|FULL} OUTER JOIN
LEFT SEMI JOIN
CROSS JOIN

Unions
Sub-queries

SELECT col FROM ( SELECT a + b AS col from t1) t2

Sampling
Explain
Partitioned tables including dynamic partition insertion
View
All Hive DDL Functions, including:

CREATE TABLE
CREATE TABLE AS SELECT
ALTER TABLE

Most Hive Data types, including:

TINYINT
SMALLINT
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
BINARY
TIMESTAMP
DATE
ARRAY<>
MAP<>
STRUCT<>

Unsupported Hive Functionality

以下是我们尚不支持的 Hive 功能列表。大多数这些功能很少用于 Hive 部署。

Major Hive Features

Tables with buckets：bucket 是 Hive 表分区内的哈希分区。 Spark SQL 尚不支持存储桶。

Esoteric Hive Features

UNION type
Unique join
Column statistics collecting: Spark SQL 目前不搭载扫描来收集列统计信息，仅支持填充 hive Metastore 的 sizeInBytes 字段。

Hive Input/Output Formats

File format for CLI:对于显示回 CLI 的结果，Spark SQL 仅支持 TextOutputFormat。
Hadoop archive

Hive Optimizations

少数 Hive 优化尚未包含在 Spark 中。由于 Spark SQL 的内存计算模型，其中一些（例如索引）不太重要。其他的则用于 Spark SQL 的未来版本。

块级位图索引和虚拟列（用于构建索引）
自动确定joins和groupbys的reducer数量：目前在Spark SQL中，你需要使用“SET spark.sql.shuffle.partitions=[num_tasks];”来控制post-shuffle的并行度。
仅元数据查询：对于仅使用元数据可以回答的查询，Spark SQL 仍会启动任务来计算结果。
倾斜数据标志：Spark SQL 不遵循 Hive 中的倾斜数据标志。
连接中的STREAMTABLE 提示：Spark SQL 不遵循STREAMTABLE 提示。
为查询结果合并多个小文件：如果结果输出包含多个小文件，Hive可以选择将小文件合并成较少的大文件，以避免溢出HDFS元数据。 Spark SQL 不支持。

Hive UDF/UDTF/UDAF

Spark SQL 并不支持 Hive UDF/UDTF/UDAF 的所有 API。以下是不受支持的 API：

getRequiredJars 和 getRequiredFiles（UDF 和 GenericUDF）是自动包含此 UDF 所需的额外资源的函数。
尚不支持 GenericUDTF 中的 initialize(StructObjectInspector)。 Spark SQL 当前仅使用不推荐使用的接口 initialize(ObjectInspector[])。
configure（GenericUDF、GenericUDTF 和GenericUDAFEvaluator）是一个用MapredContext 初始化函数的函数，不适用于Spark。
close（GenericUDF 和GenericUDAFEvaluator）是一个释放相关资源的函数。任务完成后，Spark SQL 不会调用此函数。
reset（GenericUDAFEvaluator）是一个重新初始化聚合以重用相同聚合的函数。 Spark SQL 目前不支持聚合复用。
getWindowingEvaluator（GenericUDAFEvaluator）是一个通过在固定窗口上评估聚合来优化聚合的函数。