sparksql动态分区 spark mysql 分区

转载

风轻云淡的开发 2023-05-29 13:57:59

当数据增加，我们又无法无限制的增加硬件，我们就要利用RDD的partition。将获取一个大表的任务拆分成多个任务，一个一个来执行，每个任务只获取一小部分数据，这样通过多个连接同时去取数据，速度反而更快。

我的配置目前是 master 1 8g,slave 3 8g

Dataset<Row> dataset = spark.read().format("jdbc")
.option("url", JDBCUtil.getJdbcUrl(datasourceModel))
.option("dbtable", tableName)
.option("user", datasourceModel.getUserName())
.option("password", datasourceModel.getPassword())
.option("partitionColumn", "ID")
.option("lowerBound", 10000)
.option("upperBound", 100000000)
.option("numPartitions", 10000)
.load();

参数具体意义：

`partitionColumn, lowerBound, upperBound`	These options must all be specified if any of them is specified. In addition, `numPartitions` must be specified. They describe how to partition the table when reading in parallel from multiple workers. `partitionColumn` must be a numeric column from the table in question. Notice that `lowerBound` and `upperBound` are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.
`numPartitions`	The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling `coalesce(numPartitions)` before writing.