大数据 spark flink kafka 图 spark大数据分析源码解析

转载

mob64ca13f53d41 2023-09-27 17:00:02

文章标签 spark 大数据 scala ide sed 文章分类 Spark 大数据

文章目录

1. 问题背景
2. 测试代码
3. 生成的DAG图

1. job0
2. job1

4. job0 产生的时机源码分析

1. 调用DataFrameReader.load，DataFrameReader.loadV1Source
2. 调用DataSoure.resolveRelation方法
3. 调用DataSource.getOrInferFileFormatSchema()
4. InMemoryFileIndex 初始化
5. 调用InMemoryFileIndex.bulkListLeafFiles 方法

1. path.size判断是否生成job
2. list-files 的job0

1. 设置job-description
2. 接下来开始创建执行job

5. 调用链总结

1. 问题背景

在测试spark任务的时候，发现读取目录下的多个文件，和直接读取一个文件，spark的DAG中对应的job个数不一样，读取目录下的多个文件比单个文件多一个job,下面从源码的角度做一个简单的分析,本篇文章比较长，所以分为两篇，第一篇介绍job0的源码分析过程，第二篇介绍job1的源码分析过程。

2. 测试代码

public class UserProfileTest {

    static String filePath = "hdfs:///user/daily/20200828/*.parquet";
    public static void main(String[] args) {
        SparkConf sparkConf = new SparkConf()
                .setMaster("local")
                .setAppName("user_profile_test")
                .set(ConfigurationOptions.ES_NODES, "")
                .set(ConfigurationOptions.ES_PORT, "")
                .set(ConfigurationOptions.ES_MAPPING_ID, "uid");

	//主要想要考察一下这个地方为什么会产生更多的job
        SparkSession sparkSession = SparkSession.builder().config(sparkConf).getOrCreate();
        Dataset<Row> userProfileSource = sparkSession.read().parquet(filePath);
        userProfileSource.count();
        userProfileSource.write().parquet("hdfs:///user/daily/result2020082808/");
    }
}

3. 生成的DAG图

我们这里可以看到

Dataset<Row> userProfileSource = sparkSession.read().parquet(filePath);

这一句产生了两个job，我们这里也只关注这两个job

大数据 spark flink kafka 图 spark大数据分析源码解析_spark

截取上面的有效部分放大

大数据 spark flink kafka 图 spark大数据分析源码解析_spark_02

1. job0

job0的Description是

Listing leaf files and directories for 100 paths:
hdfs://hadoop-01:9000/user/daily/20200828/part-00000-0e0dc5b5-5061-41ca-9fa6-9fb7b3e09e98-c000.snappy.parquet, ...
parquet at UserProfileTest.java:26

job1的partition数量是100

2. job1

job1的Description是

parquet at UserProfileTest.java:26
parquet at UserProfileTest.java:26

想知道这两个job产生的时机，为什么会有这个区别。

4. job0 产生的时机源码分析

1. 调用DataFrameReader.load，DataFrameReader.loadV1Source

sparkSession.read().parquet(filePath)会走到 DataFrameReader.load方法,执行条件判断的时候会走到最后一个else 执行 loadV1Source

/**
   * Loads input in as a `DataFrame`, for data sources that support multiple paths.
   * Only works if the source is a HadoopFsRelationProvider.
   *
   * @since 1.6.0
   */
  @scala.annotation.varargs
  def load(paths: String*): DataFrame = {
    if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
      throw new AnalysisException("Hive data source can only be used with tables, you can not " +
        "read files of Hive data source directly.")
    }

    val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf)
    if (classOf[DataSourceV2].isAssignableFrom(cls)) {
      val ds = cls.newInstance()
      val options = new DataSourceOptions((extraOptions ++
        DataSourceV2Utils.extractSessionConfigs(
          ds = ds.asInstanceOf[DataSourceV2],
          conf = sparkSession.sessionState.conf)).asJava)

      // Streaming also uses the data source V2 API. So it may be that the data source implements
      // v2, but has no v2 implementation for batch reads. In that case, we fall back to loading
      // the dataframe as a v1 source.
      val reader = (ds, userSpecifiedSchema) match {
        case (ds: ReadSupportWithSchema, Some(schema)) =>
          ds.createReader(schema, options)

        case (ds: ReadSupport, None) =>
          ds.createReader(options)

        case (ds: ReadSupportWithSchema, None) =>
          throw new AnalysisException(s"A schema needs to be specified when using $ds.")

        case (ds: ReadSupport, Some(schema)) =>
          val reader = ds.createReader(options)
          if (reader.readSchema() != schema) {
            throw new AnalysisException(s"$ds does not allow user-specified schemas.")
          }
          reader

        case _ => null // fall back to v1
      }

      if (reader == null) {
        loadV1Source(paths: _*)
      } else {
        Dataset.ofRows(sparkSession, DataSourceV2Relation(reader))
      }
    } else {

     // 会走到这里来
      loadV1Source(paths: _*)
    }
  }

  调用这个方法
  private def loadV1Source(paths: String*) = {
    // Code path for data source v1.
    sparkSession.baseRelationToDataFrame(
      DataSource.apply(
        sparkSession,
        paths = paths,
        userSpecifiedSchema = userSpecifiedSchema,
        className = source,
        options = extraOptions.toMap).resolveRelation())
  }

在loadV1Source中new了一个DataSource对象,这里的apply方法是因为DataSource是case类，所以产生了伴生对象，在其中定义了apply和unapply方法，参考这里进一步了解apply

然后调用了DataSoure对象的resolveRelation()方法。

2. 调用DataSoure.resolveRelation方法

/**
   * Create a resolved [[BaseRelation]] that can be used to read data from or write data into this
   * [[DataSource]]
   *
   * @param checkFilesExist Whether to confirm that the files exist when generating the
   *                        non-streaming file based datasource. StructuredStreaming jobs already
   *                        list file existence, and when generating incremental jobs, the batch
   *                        is considered as a non-streaming file based data source. Since we know
   *                        that files already exist, we don't need to check them again.
   */
  def resolveRelation(checkFilesExist: Boolean = true): BaseRelation = {
    val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
      // TODO: Throw when too much is given.
      case (dataSource: SchemaRelationProvider, Some(schema)) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
      case (dataSource: RelationProvider, None) =>
        dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
      case (_: SchemaRelationProvider, None) =>
        throw new AnalysisException(s"A schema needs to be specified when using $className.")
      case (dataSource: RelationProvider, Some(schema)) =>
        val baseRelation =
          dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
        if (baseRelation.schema != schema) {
          throw new AnalysisException(s"$className does not allow user-specified schemas.")
        }
        baseRelation

      // We are reading from the results of a streaming query. Load files from the metadata log
      // instead of listing them using HDFS APIs.
      case (format: FileFormat, _)
          if FileStreamSink.hasMetadata(
            caseInsensitiveOptions.get("path").toSeq ++ paths,
            sparkSession.sessionState.newHadoopConf()) =>
        val basePath = new Path((caseInsensitiveOptions.get("path").toSeq ++ paths).head)
        val tempFileCatalog = new MetadataLogFileIndex(sparkSession, basePath, None)
        val fileCatalog = if (userSpecifiedSchema.nonEmpty) {
          val partitionSchema = combineInferredAndUserSpecifiedPartitionSchema(tempFileCatalog)
          new MetadataLogFileIndex(sparkSession, basePath, Option(partitionSchema))
        } else {
          tempFileCatalog
        }
        val dataSchema = userSpecifiedSchema.orElse {
          format.inferSchema(
            sparkSession,
            caseInsensitiveOptions,
            fileCatalog.allFiles())
        }.getOrElse {
          throw new AnalysisException(
            s"Unable to infer schema for $format at ${fileCatalog.allFiles().mkString(",")}. " +
                "It must be specified manually")
        }

        HadoopFsRelation(
          fileCatalog,
          partitionSchema = fileCatalog.partitionSchema,
          dataSchema = dataSchema,
          bucketSpec = None,
          format,
          caseInsensitiveOptions)(sparkSession)

      // This is a non-streaming file based datasource.
      // 最后会命中这个case
      case (format: FileFormat, _) =>
        val allPaths = caseInsensitiveOptions.get("path") ++ paths
        val hadoopConf = sparkSession.sessionState.newHadoopConf()
        val globbedPaths = allPaths.flatMap(
          DataSource.checkAndGlobPathIfNecessary(hadoopConf, _, checkFilesExist)).toArray

        val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
       // 这里会发生调用关系
        val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format, fileStatusCache)

        val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions &&
            catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) {
          val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes
          new CatalogFileIndex(
            sparkSession,
            catalogTable.get,
            catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
        } else {
          new InMemoryFileIndex(
            sparkSession, globbedPaths, options, Some(partitionSchema), fileStatusCache)
        }

        HadoopFsRelation(
          fileCatalog,
          partitionSchema = partitionSchema,
          dataSchema = dataSchema.asNullable,
          bucketSpec = bucketSpec,
          format,
          caseInsensitiveOptions)(sparkSession)

      case _ =>
        throw new AnalysisException(
          s"$className is not a valid Spark SQL Data Source.")
    }

    relation match {
      case hs: HadoopFsRelation =>
        SchemaUtils.checkColumnNameDuplication(
          hs.dataSchema.map(_.name),
          "in the data schema",
          equality)
        SchemaUtils.checkColumnNameDuplication(
          hs.partitionSchema.map(_.name),
          "in the partition schema",
          equality)
      case _ =>
        SchemaUtils.checkColumnNameDuplication(
          relation.schema.map(_.name),
          "in the data schema",
          equality)
    }

    relation
  }

在上面方法的

val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format, fileStatusCache)

调用了 getOrInferFileFormatSchema方法

3. 调用DataSource.getOrInferFileFormatSchema()

private def getOrInferFileFormatSchema(
      format: FileFormat,
      fileStatusCache: FileStatusCache = NoopCache): (StructType, StructType) = {
    // the operations below are expensive therefore try not to do them if we don't need to, e.g.,
    // in streaming mode, we have already inferred and registered partition columns, we will
    // never have to materialize the lazy val below
    // 这里定义的是lazy变量，最终使用的时候才会初始化
    lazy val tempFileIndex = {
      val allPaths = caseInsensitiveOptions.get("path") ++ paths
      val hadoopConf = sparkSession.sessionState.newHadoopConf()
      val globbedPaths = allPaths.toSeq.flatMap { path =>
        val hdfsPath = new Path(path)
        val fs = hdfsPath.getFileSystem(hadoopConf)
        val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
        SparkHadoopUtil.get.globPathIfNecessary(fs, qualified)
      }.toArray
      // 这个地方初始化了InMemoryFileIndex 对象，也就是在这里形成了第一个job
      new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
    }
    val partitionSchema = if (partitionColumns.isEmpty) {
      // Try to infer partitioning, because no DataSource in the read path provides the partitioning
      // columns properly unless it is a Hive DataSource
      // 在这里第一次真正使用lazy的tempFileIndex变量，也就促使了InMemoryFileIndex 的初始化。
      combineInferredAndUserSpecifiedPartitionSchema(tempFileIndex)
    } else {
      // maintain old behavior before SPARK-18510. If userSpecifiedSchema is empty used inferred
      // partitioning
      if (userSpecifiedSchema.isEmpty) {
        val inferredPartitions = tempFileIndex.partitionSchema
        inferredPartitions
      } else {
        val partitionFields = partitionColumns.map { partitionColumn =>
          userSpecifiedSchema.flatMap(_.find(c => equality(c.name, partitionColumn))).orElse {
            val inferredPartitions = tempFileIndex.partitionSchema
            val inferredOpt = inferredPartitions.find(p => equality(p.name, partitionColumn))
            if (inferredOpt.isDefined) {
              logDebug(
                s"""Type of partition column: $partitionColumn not found in specified schema
                   |for $format.
                   |User Specified Schema
                   |=====================
                   |${userSpecifiedSchema.orNull}
                   |
                   |Falling back to inferred dataType if it exists.
                 """.stripMargin)
            }
            inferredOpt
          }.getOrElse {
            throw new AnalysisException(s"Failed to resolve the schema for $format for " +
              s"the partition column: $partitionColumn. It must be specified manually.")
          }
        }
        StructType(partitionFields)
      }
    }

    val dataSchema = userSpecifiedSchema.map { schema =>
      StructType(schema.filterNot(f => partitionSchema.exists(p => equality(p.name, f.name))))
    }.orElse {
      format.inferSchema(
        sparkSession,
        caseInsensitiveOptions,
        tempFileIndex.allFiles())
    }.getOrElse {
      throw new AnalysisException(
        s"Unable to infer schema for $format. It must be specified manually.")
    }

    // We just print a waring message if the data schema and partition schema have the duplicate
    // columns. This is because we allow users to do so in the previous Spark releases and
    // we have the existing tests for the cases (e.g., `ParquetHadoopFsRelationSuite`).
    // See SPARK-18108 and SPARK-21144 for related discussions.
    try {
      SchemaUtils.checkColumnNameDuplication(
        (dataSchema ++ partitionSchema).map(_.name),
        "in the data schema and the partition schema",
        equality)
    } catch {
      case e: AnalysisException => logWarning(e.getMessage)
    }

    (dataSchema, partitionSchema)
  }

在这里会调用到

new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)

4. InMemoryFileIndex 初始化

接着来看看InMemoryFileIndex 类

class InMemoryFileIndex(
    sparkSession: SparkSession,
    rootPathsSpecified: Seq[Path],
    parameters: Map[String, String],
    partitionSchema: Option[StructType],
    fileStatusCache: FileStatusCache = NoopCache)
  extends PartitioningAwareFileIndex(
    sparkSession, parameters, partitionSchema, fileStatusCache) {

  // Filter out streaming metadata dirs or files such as "/.../_spark_metadata" (the metadata dir)
  // or "/.../_spark_metadata/0" (a file in the metadata dir). `rootPathsSpecified` might contain
  // such streaming metadata dir or files, e.g. when after globbing "basePath/*" where "basePath"
  // is the output of a streaming query.
  override val rootPaths =
    rootPathsSpecified.filterNot(FileStreamSink.ancestorIsMetadataDirectory(_, hadoopConf))

  @volatile private var cachedLeafFiles: mutable.LinkedHashMap[Path, FileStatus] = _
  @volatile private var cachedLeafDirToChildrenFiles: Map[Path, Array[FileStatus]] = _
  @volatile private var cachedPartitionSpec: PartitionSpec = _

  //该类在初始化的时候回执行 ```refresh0 ```方法
  refresh0()

  ....
  ....
  ....

该类在初始化的时候回执行 refresh0方法

private def refresh0(): Unit = {
    // 这里发生了调用
    val files = listLeafFiles(rootPaths)
    cachedLeafFiles =
      new mutable.LinkedHashMap[Path, FileStatus]() ++= files.map(f => f.getPath -> f)
    cachedLeafDirToChildrenFiles = files.toArray.groupBy(_.getPath.getParent)
    cachedPartitionSpec = null
  }

在refresh0方法中又会调用 listLeafFiles(rootPaths)方法。

/**
   * List leaf files of given paths. This method will submit a Spark job to do parallel
   * listing whenever there is a path having more files than the parallel partition discovery
   * discovery threshold.
   *
   * This is publicly visible for testing.
   */
  def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
    val output = mutable.LinkedHashSet[FileStatus]()
    val pathsToFetch = mutable.ArrayBuffer[Path]()
    for (path <- paths) {
      fileStatusCache.getLeafFiles(path) match {
        case Some(files) =>
          HiveCatalogMetrics.incrementFileCacheHits(files.length)
          output ++= files
        case None =>
          pathsToFetch += path
      }
      Unit // for some reasons scalac 2.12 needs this; return type doesn't matter
    }
    val filter = FileInputFormat.getInputPathFilter(new JobConf(hadoopConf, this.getClass))
    // 这里发生了bulkListLeafFiles 的调用
    val discovered = InMemoryFileIndex.bulkListLeafFiles(
      pathsToFetch, hadoopConf, filter, sparkSession)
    discovered.foreach { case (path, leafFiles) =>
      HiveCatalogMetrics.incrementFilesDiscovered(leafFiles.size)
      fileStatusCache.putLeafFiles(path, leafFiles.toArray)
      output ++= leafFiles
    }
    output
  }
}

然后又发生了对InMemoryFileIndex.bulkListLeafFiles方法的调用

5. 调用InMemoryFileIndex.bulkListLeafFiles 方法

/**
   * Lists a collection of paths recursively. Picks the listing strategy adaptively depending
   * on the number of paths to list.
   *
   * This may only be called on the driver.
   *
   * @return for each input path, the set of discovered files for the path
   */
  private def bulkListLeafFiles(
      paths: Seq[Path],
      hadoopConf: Configuration,
      filter: PathFilter,
      sparkSession: SparkSession): Seq[(Path, Seq[FileStatus])] = {
    //在这里如果path下的数量小于32(parallelPartitionDiscoveryThreshold的默认值),就直接返回了,
    // 如果大于32的话会开一个job单独来查找有哪些文件，防止万一path下的文件太多耗时比较长
    // Short-circuits parallel listing when serial listing is likely to be faster.
    if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
      return paths.map { path =>
        (path, listLeafFiles(path, hadoopConf, filter, Some(sparkSession)))
      }
    }

    logInfo(s"Listing leaf files and directories in parallel under: ${paths.mkString(", ")}")
    HiveCatalogMetrics.incrementParallelListingJobCount(1)

    val sparkContext = sparkSession.sparkContext
    val serializableConfiguration = new SerializableConfiguration(hadoopConf)
    val serializedPaths = paths.map(_.toString)
    val parallelPartitionDiscoveryParallelism =
      sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism

    // Set the number of parallelism to prevent following file listing from generating many tasks
    // in case of large #defaultParallelism.
    val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism)

    val previousJobDescription = sparkContext.getLocalProperty(SparkContext.SPARK_JOB_DESCRIPTION)
    val statusMap = try {
      // 在这里会判断出 job的description为 Listing leaf files and directories for 100 paths:
      val description = paths.size match {
        case 0 =>
          s"Listing leaf files and directories 0 paths"
        case 1 =>
          s"Listing leaf files and directories for 1 path:<br/>${paths(0)}"
        case s =>
          s"Listing leaf files and directories for $s paths:<br/>${paths(0)}, ..."
      }
      //这里对job Description进行设置
      sparkContext.setJobDescription(description)
      sparkContext
        .parallelize(serializedPaths, numParallelism)
        .mapPartitions { pathStrings =>
          val hadoopConf = serializableConfiguration.value
          pathStrings.map(new Path(_)).toSeq.map { path =>
            (path, listLeafFiles(path, hadoopConf, filter, None))
          }.iterator
        }.map { case (path, statuses) =>
        val serializableStatuses = statuses.map { status =>
          // Turn FileStatus into SerializableFileStatus so we can send it back to the driver
          val blockLocations = status match {
            case f: LocatedFileStatus =>
              f.getBlockLocations.map { loc =>
                SerializableBlockLocation(
                  loc.getNames,
                  loc.getHosts,
                  loc.getOffset,
                  loc.getLength)
              }

            case _ =>
              Array.empty[SerializableBlockLocation]
          }

          SerializableFileStatus(
            status.getPath.toString,
            status.getLen,
            status.isDirectory,
            status.getReplication,
            status.getBlockSize,
            status.getModificationTime,
            status.getAccessTime,
            blockLocations)
        }
        (path.toString, serializableStatuses)

      // 这里的collect() 为action算子，所以会触发一个job的形成
      }.collect()
    } finally {
      sparkContext.setJobDescription(previousJobDescription)
    }

    // turn SerializableFileStatus back to Status
    statusMap.map { case (path, serializableStatuses) =>
      val statuses = serializableStatuses.map { f =>
        val blockLocations = f.blockLocations.map { loc =>
          new BlockLocation(loc.names, loc.hosts, loc.offset, loc.length)
        }
        new LocatedFileStatus(
          new FileStatus(
            f.length, f.isDir, f.blockReplication, f.blockSize, f.modificationTime,
            new Path(f.path)),
          blockLocations)
      }
      (new Path(path), statuses)
    }
  }

下面的代码都是上面InMemoryFileIndex.bulkListLeafFiles方法的部分节选分析

1. path.size判断是否生成job

// Short-circuits parallel listing when serial listing is likely to be faster.
  if (paths.size <= sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold) {
    return paths.map { path =>
      (path, listLeafFiles(path, hadoopConf, filter, Some(sparkSession)))
    }
  }

这一段代码主要是用来判断传过来的一级目录下有多少path,在我们这里对应的就是匹配路径hdfs:///user/daily/20200828/*.parquet的有多少个path,这个时候spark并不认为匹配的路径是一个文件，只是当作一个目录应对，因为spark支持多级目录的识别，所以，如果目录比较多的话都放在driver端进行查找的话耗时可能会很长，在path的数量大于32的时候会生成一个job，扔到yarn集群中通过多个executor来进行并行的查找。

sparkSession.sessionState.conf.parallelPartitionDiscoveryThreshold的值参考下面的代码

def parallelPartitionDiscoveryThreshold: Int =
    getConf(SQLConf.PARALLEL_PARTITION_DISCOVERY_THRESHOLD)


  val PARALLEL_PARTITION_DISCOVERY_THRESHOLD =
    buildConf("spark.sql.sources.parallelPartitionDiscovery.threshold")
      .doc("The maximum number of paths allowed for listing files at driver side. If the number " +
        "of detected paths exceeds this value during partition discovery, it tries to list the " +
        "files with another Spark distributed job. This applies to Parquet, ORC, CSV, JSON and " +
        "LibSVM data sources.")
      .intConf
      .checkValue(parallel => parallel >= 0, "The maximum number of paths allowed for listing " +
        "files at driver side must not be negative")
      .createWithDefault(32)

这里因为hdfs:///user/daily/20200828/*.parquet有100个文件，所以上面的if并不成立，也就是会走到下面生成job0来查找文件

注意这里如果小于32调用的方法是 listLeafFiles(path, hadoopConf, filter, Some(sparkSession))并不是上面的 listLeafFiles(paths: Seq[Path])

/**
   * Lists a single filesystem path recursively. If a SparkSession object is specified, this
   * function may launch Spark jobs to parallelize listing.
   *
   * If sessionOpt is None, this may be called on executors.
   *
   * @return all children of path that match the specified filter.
   */
  private def listLeafFiles(
      path: Path,
      hadoopConf: Configuration,
      filter: PathFilter,
      sessionOpt: Option[SparkSession]): Seq[FileStatus] = {

   ....
  }

这里省略了方法体，从方法签名上可以看到是Lists a single filesystem path recursively 就是从一个路径下递归的查找文件的意思，也就是说一级路径数量小于32会在driver端对每个路径进行递归的查找。注意这个方法也是属于InMemoryFileIndex，但是和上面出现的

def listLeafFiles(paths: Seq[Path]): mutable.LinkedHashSet[FileStatus] = {
  ...
  ...
}

不是同一个方法

2. list-files 的job0

因为上面的if代码不会执行，接着往下走就是对应生成的job0的代码，因为还是有一些内容的，我们会再拆开了看，当然，在代码中也有详细的注释

val sparkContext = sparkSession.sparkContext
      val parallelPartitionDiscoveryParallelism =
        sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism

      // Set the number of parallelism to prevent following file listing from generating many tasks
      // in case of large #defaultParallelism.
      val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism)

      val previousJobDescription = sparkContext.getLocalProperty(SparkContext.SPARK_JOB_DESCRIPTION)
      val statusMap = try {
        // 在这里会判断出 job的description为 Listing leaf files and directories for 100 paths:
        val description = paths.size match {
          case 0 =>
            s"Listing leaf files and directories 0 paths"
          case 1 =>
            s"Listing leaf files and directories for 1 path:<br/>${paths(0)}"
          case s =>
            s"Listing leaf files and directories for $s paths:<br/>${paths(0)}, ..."
        }
        //这里对job Description进行设置
        sparkContext.setJobDescription(description)
        sparkContext
          .parallelize(serializedPaths, numParallelism)
          .mapPartitions { pathStrings =>
            val hadoopConf = serializableConfiguration.value
            pathStrings.map(new Path(_)).toSeq.map { path =>
              (path, listLeafFiles(path, hadoopConf, filter, None))
            }.iterator
          }.map { case (path, statuses) =>
          val serializableStatuses = statuses.map { status =>
            // Turn FileStatus into SerializableFileStatus so we can send it back to the driver
            val blockLocations = status match {
              case f: LocatedFileStatus =>
                f.getBlockLocations.map { loc =>
                  SerializableBlockLocation(
                    loc.getNames,
                    loc.getHosts,
                    loc.getOffset,
                    loc.getLength)
                }

              case _ =>
                Array.empty[SerializableBlockLocation]
            }

            SerializableFileStatus(
              status.getPath.toString,
              status.getLen,
              status.isDirectory,
              status.getReplication,
              status.getBlockSize,
              status.getModificationTime,
              status.getAccessTime,
              blockLocations)
          }
          (path.toString, serializableStatuses)

        // 这里的collect() 为action算子，所以会触发一个job的形成
        }.collect()

1. 设置job-description

在bulkListLeafFiles() 中设置job-description为

val description = paths.size match {
  	        case 0 =>
  	          s"Listing leaf files and directories 0 paths"
  	        case 1 =>
  	          s"Listing leaf files and directories for 1 path:<br/>${paths(0)}"
  	        case s =>
  	          s"Listing leaf files and directories for $s paths:<br/>${paths(0)}, ..."
  	      }

sparkContext.setJobDescription(description)

2. 接下来开始创建执行job

val parallelPartitionDiscoveryParallelism = sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism
 val numParallelism = Math.min(paths.size, parallelPartitionDiscoveryParallelism)


  sparkContext.setJobDescription(description)
  sparkContext
    .parallelize(serializedPaths, numParallelism)
    .mapPartitions { pathStrings =>
      val hadoopConf = serializableConfiguration.value
      pathStrings.map(new Path(_)).toSeq.map { path =>
        (path, listLeafFiles(path, hadoopConf, filter, None))
      }.iterator
    }.map { case (path, statuses) =>
    val serializableStatuses = statuses.map { status =>
      // Turn FileStatus into SerializableFileStatus so we can send it back to the driver
      val blockLocations = status match {
        case f: LocatedFileStatus =>
          f.getBlockLocations.map { loc =>
            SerializableBlockLocation(
              loc.getNames,
              loc.getHosts,
              loc.getOffset,
              loc.getLength)
          }

        case _ =>
          Array.empty[SerializableBlockLocation]
      }

      SerializableFileStatus(
        status.getPath.toString,
        status.getLen,
        status.isDirectory,
        status.getReplication,
        status.getBlockSize,
        status.getModificationTime,
        status.getAccessTime,
        blockLocations)
    }
    (path.toString, serializableStatuses)

  // 这里的collect() 为action算子，所以会触发一个job的形成
  }.collect()

这里可以看到，并行度的设置为Math.min(paths.size, parallelPartitionDiscoveryParallelism) 这里的调试发现sparkSession.sessionState.conf.parallelPartitionDiscoveryParallelism默认值为10000
所以numParallelism=paths.size 为100(在对应的目录下有100个paquet文件)
而且这个并行任务的最终方式是递归的找到所有文件的block信息,可以通过这段代码看出来

mapPartitions { pathStrings =>
      val hadoopConf = serializableConfiguration.value
      pathStrings.map(new Path(_)).toSeq.map { path =>
        (path, listLeafFiles(path, hadoopConf, filter, None))
      }.iterator
    }

里面的listLeafFiles(path, hadoopConf, filter, None)的定义是递归的从一个路径下查找所有的文件

5. 调用链总结

DataFrameReader.load()
DataFrameReader.loadV1Source()
DataSoure.resolveRelation()
DataSource.getOrInferFileFormatSchema()
new InMemoryFileIndex(sparkSession, globbedPaths, options, None, fileStatusCache)
InMemoryFileIndex.refresh0()
InMemoryFileIndex.listLeafFiles()
InMemoryFileIndex.bulkListLeafFiles()

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：python 在线题库开发 python题库网站

下一篇：python opencv gama变换 python opencv kmeans

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯