1. 准备
Hudi支持Spark-2.x版本,你可以点击如下链接安装Spark,并使用pyspark启动
# pyspark
export PYSPARK_PYTHON
=
$
(
which python3
)
spark
-
2.4
.
4
-
bin
-
hadoop2
.
7
/
bin
/
pyspark \
--
packages org
.
apache
.
hudi
:
hudi
-
spark
-
bundle_2
.
11
:
0.5
.
1
-
incubating
,
org
.
apache
.
spark
:
spark
-
avro_2
.
11
:
2.4
.
4
\
--
conf
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
- spark-avro模块需要在--packages显示指定
- spark-avro和spark的版本必须匹配
- 本示例中,由于依赖spark-avro2.11,因此使用的是scala2.11构建hudi-spark-bundle,如果使用spark-avro2.12,相应的需要使用hudi-spark-bundle_2.12
进行一些前置变量初始化
# pyspark
tableName
=
"hudi_trips_cow"
basePath
=
"file:///tmp/hudi_trips_cow"
dataGen
=
sc
.
_jvm
.
org
.
apache
.
hudi
.
QuickstartUtils
.
DataGenerator
()
其中DataGenerator可以用来基于行程schema生成插入和删除的样例数据。
2. 插入数据
生成一些新的行程数据,加载到DataFrame中,并将DataFrame写入Hudi表
# pyspark
inserts
=
sc
.
_jvm
.
org
.
apache
.
hudi
.
QuickstartUtils
.
convertToStringList
(
dataGen
.
generateInserts
(
10
))
df
=
spark
.
read
.
json
(
spark
.
sparkContext
.
parallelize
(
inserts
,
2
))
hudi_options
=
{
'hoodie.table.name'
:
tableName
,
'hoodie.datasource.write.recordkey.field'
:
'uuid'
,
'hoodie.datasource.write.partitionpath.field'
:
'partitionpath'
,
'hoodie.datasource.write.table.name'
:
tableName
,
'hoodie.datasource.write.operation'
:
'insert'
,
'hoodie.datasource.write.precombine.field'
:
'ts'
,
'hoodie.upsert.shuffle.parallelism'
:
2
,
'hoodie.insert.shuffle.parallelism'
:
2
}
df
.
write
.
format
(
"hudi"
).
\
options
(**
hudi_options
).
\
mode
(
"overwrite"
).
\
save
(
basePath
)
mode(Overwrite)会覆盖并重新创建数据集。示例中提供了一个主键 (schema中的
uuid),分区字段(
region/county/city)和组合字段(schema中的
ts) 以确保行程记录在每个分区中都是唯一的。
3. 查询数据
将数据加载至DataFrame
# pyspark
tripsSnapshotDF
=
spark
.
\
read
.
\
format
(
"hudi"
).
\
load
(
basePath
+
"/*/*/*/*"
)
tripsSnapshotDF
.
createOrReplaceTempView
(
"hudi_trips_snapshot"
)
spark
.
sql
(
"select fare, begin_lon, begin_lat, ts from hudi_trips_snapshot where fare > 20.0"
).
show
()
spark
.
sql
(
"select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_trips_snapshot"
).
show
()
该查询提供读取优化视图,由于我们的分区路径格式为
region/country/city),从基本路径(basepath)开始,我们使用
load(basePath+"/*/*/*/*")来加载数据。
4. 更新数据
与插入新数据类似,还是使用DataGenerator生成更新数据,然后使用DataFrame写入Hudi表。
# pyspark
updates
=
sc
.
_jvm
.
org
.
apache
.
hudi
.
QuickstartUtils
.
convertToStringList
(
dataGen
.
generateUpdates
(
10
))
df
=
spark
.
read
.
json
(
spark
.
sparkContext
.
parallelize
(
updates
,
2
))
df
.
write
.
format
(
"hudi"
).
\
options
(**
hudi_options
).
\
mode
(
"append"
).
\
save
(
basePath
)
注意,现在保存模式现在为
append。通常,除非是第一次尝试创建数据集,否则请始终使用追加模式。每个写操作都会生成一个新的由时间戳表示的commit 。
5. 增量查询
Hudi提供了增量拉取的能力,即可以拉取从指定commit时间之后的变更,如不指定结束时间,那么将会拉取最新的变更。
# pyspark
# reload data
spark
.
\
read
.
\
format
(
"hudi"
).
\
load
(
basePath
+
"/*/*/*/*"
).
\
createOrReplaceTempView
(
"hudi_trips_snapshot"
)
commits
=
list
(
map
(
lambda
row
:
row
[
0
],
spark
.
sql
(
"select distinct(_hoodie_commit_time) as commitTime from hudi_trips_snapshot order by commitTime"
).
limit
(
50
).
collect
()))
beginTime
=
commits
[
len
(
commits
)
-
2
]
# commit time we are interested in
# incrementally query data
incremental_read_options
=
{
'hoodie.datasource.query.type'
:
'incremental'
,
'hoodie.datasource.read.begin.instanttime'
:
beginTime
,
}
tripsIncrementalDF
=
spark
.
read
.
format
(
"hudi"
).
\
options
(**
incremental_read_options
).
\
load
(
basePath
)
tripsIncrementalDF
.
createOrReplaceTempView
(
"hudi_trips_incremental"
)
spark
.
sql
(
"select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_incremental where fare > 20.0"
).
show
()
这表示查询在开始时间提交之后的所有变更,此增量拉取功能可以在批量数据上构建流式管道。
6. 特定时间点查询
即如何查询特定时间的数据,可以通过将结束时间指向特定的提交时间,将开始时间指向”000”(表示最早的提交时间)来表示特定时间。
# pyspark
beginTime
=
"000"
# Represents all commits > this time.
endTime
=
commits
[
len
(
commits
)
-
2
]
# query point in time data
point_in_time_read_options
=
{
'hoodie.datasource.query.type'
:
'incremental'
,
'hoodie.datasource.read.end.instanttime'
:
endTime
,
'hoodie.datasource.read.begin.instanttime'
:
beginTime
}
tripsPointInTimeDF
=
spark
.
read
.
format
(
"hudi"
).
\
options
(**
point_in_time_read_options
).
\
load
(
basePath
)
tripsPointInTimeDF
.
createOrReplaceTempView
(
"hudi_trips_point_in_time"
)
spark
.
sql
(
"select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_trips_point_in_time where fare > 20.0"
).
show
()
7. 删除数据
删除传入的HoodieKey集合,注意:删除操作只支持append模式
# pyspark
# fetch total records count
spark
.
sql
(
"select uuid, partitionPath from hudi_trips_snapshot"
).
count
()
# fetch two records to be deleted
ds
=
spark
.
sql
(
"select uuid, partitionPath from hudi_trips_snapshot"
).
limit
(
2
)
# issue deletes
hudi_delete_options
=
{
'hoodie.table.name'
:
tableName
,
'hoodie.datasource.write.recordkey.field'
:
'uuid'
,
'hoodie.datasource.write.partitionpath.field'
:
'partitionpath'
,
'hoodie.datasource.write.table.name'
:
tableName
,
'hoodie.datasource.write.operation'
:
'delete'
,
'hoodie.datasource.write.precombine.field'
:
'ts'
,
'hoodie.upsert.shuffle.parallelism'
:
2
,
'hoodie.insert.shuffle.parallelism'
:
2
}
from
pyspark
.
sql
.
functions
import
lit
deletes
=
list
(
map
(
lambda
row
:
(
row
[
0
],
row
[
1
]),
ds
.
collect
()))
df
=
spark
.
sparkContext
.
parallelize
(
deletes
).
toDF
([
'partitionpath'
,
'uuid'
]).
withColumn
(
'ts'
,
lit
(
0.0
))
df
.
write
.
format
(
"hudi"
).
\
options
(**
hudi_delete_options
).
\
mode
(
"append"
).
\
save
(
basePath
)
# run the same read query as above.
roAfterDeleteViewDF
=
spark
.
\
read
.
\
format
(
"hudi"
).
\
load
(
basePath
+
"/*/*/*/*"
)
roAfterDeleteViewDF
.
registerTempTable
(
"hudi_trips_snapshot"
)
# fetch should return (total - 2) records
spark
.
sql
(
"select uuid, partitionPath from hudi_trips_snapshot"
).
count
()
8. 总结
本篇博文展示了如何使用pyspark来插入、删除、更新Hudi表,有pyspark和Hudi需求的小伙伴不妨一试!