spark 查分区 spark分区数的确定

转载

jowvid 2024-04-17 10:39:05

文章标签 spark 查分区 Spark 数据 spark 并行度 文章分类 Spark 大数据

<1>、当RDD数据来源于内存

一、看local~模式下访问Spark，默认内存分区数

二、那么totalCores是一个什么值呢？

<2>、当RDD数据来源于内存，并指定分区

<3>、当RDD数据来源于文件而非内存

<4>、总结

<1>、当RDD数据来源于内存

首先上IDEA代码，这时makeRDD数据来自于内存

object Spark01_Partition {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[*]").setAppName("Partition")

        //构建Spark上下文对象
        val sc = new SparkContext(conf)
        val arrayRDD = sc.makeRDD(Array(1,2,3,4))
        println(arrayRDD.partitions.length)

        //释放资源
        sc.stop()
    }
}

arrayRDD分区数，打印结果

（1）setMaster（"local"）时，为1

（2）setMaster（"local[4]"）时，为4

（3）setMaster（"local[*]"）时，为8（本机是4核8线程；设备处理器8个）

下面通过追源码的方式，来分析当通过local模式访问Spark时，RDD的分区数的影响因素。

一、看local~模式下访问Spark，默认内存分区数

1. 看makeRDD源码

def makeRDD[T: ClassTag](
      seq: Seq[T],
      numSlices: Int = defaultParallelism): RDD[T] = withScope {
    parallelize(seq, numSlices)
  }

defaultParallelism：并行度；

每一个任务可以放在不同的Executor并行执行，defaultParallelism决定有多少个Executor，也就是有多少个分区。

2. 继续追defaultParallelism源码

def defaultParallelism: Int = {
    assertNotStopped()
    taskScheduler.defaultParallelism
  }

3. 追taskScheduler源码

private[spark] def taskScheduler: TaskScheduler = _taskScheduler
  private[spark] def taskScheduler_=(ts: TaskScheduler): Unit = {
    _taskScheduler = ts
  }

4. Ctrl+h，查找TaskScheduler的实现类；然后Ctrl+f查找defaultParallelism，如下

override def defaultParallelism(): Int = backend.defaultParallelism()

5. backend是系统默认后台，查找defaultParallelism（默认并行度）方法源码

private[spark] trait SchedulerBackend {
  private val appId = "spark-application-" + System.currentTimeMillis

6. 发现ScheduleBackend依然是个trait，查找其实现类，发现有三个

spark 查分区 spark分区数的确定_Spark

分别是：CoarseGrained粗粒度调度后台（用的集群yarn）、Standalone独立模式调度后台、Local本地调度后台

7. 当前模式用的是local，所以打开LocalScheduleBackend实现类，继续搜defaultParallelism

override def defaultParallelism(): Int =
    scheduler.conf.getInt("spark.default.parallelism", totalCores)

如果在spark中配置了并行度就用配置的值，否则用默认的totalCores；这里我们程序并没有设置并行度，所以

默认并行度defaultParallelism取值为totalCores，即totalCores决定了默认内存RDD的分区数。

二、那么totalCores是一个什么值呢？

8. 搜SparkContext源码，2496行，createTaskScheduler，创建任务调度器，有几个分区创建几个任务；传入的master就是setMaster中传入的String

private def createTaskScheduler(
      sc: SparkContext,
      master: String,
      deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
      case "local" =>
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_REGEX(threads) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

分了三种情况：

（1）当master是local时，val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)，再搜LocalSchedulerBackend源码

private[spark] class LocalSchedulerBackend(
    conf: SparkConf,
    scheduler: TaskSchedulerImpl,
    val totalCores: Int)

发现此时totalCores就是1，

即如果使用local模式访问Spark，默认的内存RDD分区为1。

（2）当master是local[K]，threadCount = threads.toInt，totalCores = threads.toInt；通过LOCAL_N_REGEX(threads)可知local[K]，threads = K，所以totalCores = K；

即如果使用local[K]访问Spark，默认的内存RDD分区为K。

（3）当master是local[*]，threadCount = localCpuCount，即本地设备处理器的核数；LocalScheduleBackend中的totalCores = localCpuCount；

即如果使用local[*]访问Spark，默认的内存RDD分区为当前CPU的最大核数（设备处理器最大核数，即最大线程数）。

case LOCAL_N_REGEX(threads) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

（4）如果不是local，在下面的源码中也有说明，这里不着重涉及。

<2>、当RDD数据来源于内存，并指定分区

object Spark01_Partition {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[*]").setAppName("Partition")

        //构建Spark上下文对象
        val sc = new SparkContext(conf)
        val arrayRDD = sc.makeRDD(Array(1,2,3,4),2)
        println(arrayRDD.partitions.length)

        //释放资源
        sc.stop()
    }
}

这里makeRDD中指定了分区数为2，打印结果发现，无论setMaster后是local、local[K]、local[*]，打印结果即RDD分区数都是2

结论：如果makeRDD中指定了分区数，以指定的分区数为准

<3>、当RDD数据来源于文件而非内存

新建两个txt文件，位于E:\\input目录下

1.txt中：是12，2个字节；

2.txt中：是1\n2，4个字节（换行符2个字节）

object Spark01_Partition {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[*]").setAppName("Partition")

        //构建Spark上下文对象
        val sc = new SparkContext(conf)
//        val arrayRDD = sc.makeRDD(Array(1,2,3,4),2)
        val fileRDD = sc.textFile("E:\\input")

        println(fileRDD.partitions.length)

        //释放资源
        sc.stop()
    }
}

打印结果为3，即1.txt和2.txt分成了3个分区

1. 查看textFile源码

def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

2. minPartitions最小分区数，是defaultMinPartitions，点击这个继续追源码

def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

3. 发现defaultMinPartitions取值是defaultParallelism和2的较小值，追defaultParallelism源码

def defaultParallelism: Int = {
    assertNotStopped()
    taskScheduler.defaultParallelism
  }

4. defaultParallelism并行度，setMaster后是local[*]，所以按照<1>中的结论，defaultParallelism是8

此时defaultMinPartitions（最小分区数）是2，最小分区数知道了，那么实际的分区数是多少呢？继续往下看

hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

5. 发现分区规则与hadoop的分区规则一致，打开TextInputFormat

public class TextInputFormat extends FileInputFormat<LongWritable, Text>

6. 点击父类FileInputFormat

long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);

totalSize是input目录下的总字节数，6

numSplits是切分数量，取值为minPartitions=2；所以goalSize=6/2=3

7. 下面计算每个分片的大小

long splitSize = computeSplitSize(goalSize, minSize, blockSize);

goalSize=3，minSize=1，blockSize=32M（文件的块大小，本地文件的块大小是32M，HDFS中块大小是128M）；接下来计算到底每个分片中大小是多少，点击computeSplitSize

protected long computeSplitSize(long goalSize, long minSize,
                                       long blockSize) {
    return Math.max(minSize, Math.min(goalSize, blockSize));
  }

8. 最终return值是3；即每个分片要求3个字节

9. 目前1.txt是2字节，2.txt是4字节；所以1.txt分一片，2.txt分两片；4个数分了3个区；那么分区怎么分的呢？

object Spark01_Partition {
    def main(args: Array[String]): Unit = {
        val conf = new SparkConf().setMaster("local[*]").setAppName("Partition")

        //构建Spark上下文对象
        val sc = new SparkContext(conf)
//        val arrayRDD = sc.makeRDD(Array(1,2,3,4),2)
        val fileRDD = sc.textFile("E:\\input")

        //取出分区后的文件
        fileRDD.saveAsTextFile("E:\\output")

        println(fileRDD.partitions.length)

        //释放资源
        sc.stop()
    }
}

执行程序，把分片后的文件取出到e盘下的output目录下

spark 查分区 spark分区数的确定_spark_02