spark 读取文件 spark读取文件夹下文件

转载

mob64ca1412b28c 2023-08-03 21:30:25

文章标签 spark 读取文件 hadoop spark Hadoop 文章分类 Spark 大数据

1，spark读文件流程

从本地读取txt文件：

// path最后可以是文件或文件夹，还可以用通配符
 val path = “file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/”
 val rdd1 = sparkcontext.textFile(path,2)
 从hdfs读取文件：sparkcontext.textFile("hdfs://s1:8020/user/hdfs/input”)

textFile()

Read a text file from HDFS, a local file system (available on all nodes),
or any Hadoop-supported file system URI, and return it as an RDD of Strings.

那么，使用textfile读取文件时候，到底是根据什么分区的呢？分区数和分区大小又是多少？

textfile返回RDD的Key、Value都是由InputFormat决定的。由传入hadoopFile()的参数可知，
value是Text类型的，即lines of the text file，那么key的类型LongWritable是什么意思呢？

表示当前(每个split)在文件中的偏移量，也就是每个分区的起始Offset。

主要逻辑如下：

org.apache.spark.SparkContext.scala

传入text/HDFS文件的路径和建议的最小分区数：默认值是2，实际可能比这个要大，比如文件特别多或者特别大时，也可能更小。
返回’RDD of lines of the text file’，并把hadoopFile的(key, value) map()成 value.toString，只返回文本行。

def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

  def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // Hadoop的FileSystem，来自：org.apache.hadoop.fs.{FileSystem}
    // This is a hack to enforce loading hdfs-site.xml.
    FileSystem.getLocal(hadoopConfiguration)

    // 一个Hadoop配置文件能达到10KB，需要设为广播变量。
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
    
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

HadoopRDD继承了RDD，使用旧的MapReduce API (‘org.apache.hadoop.mapred’)，
提供了读取存储在Hadoop上的数据的核心功能，
重载了getPartitions/compute/getPreferredLocations/persist方法。

见 org.apache.spark.rdd.HadoopRDD.scala

HadoopRDD的getPartitions()方法

输入数据会被切分成N份，每份数据对应到RDD中的一个Partition，Partition的数量决定了task的数量，影响着程序的并行度。
HadoopRDD的getPartitions()方法解释了spark partition是怎么决定的。

添加了注释：

override def getPartitions: Array[Partition] = {
    val jobConf = getJobConf()
    // add the credentials here as this can be called before SparkContext initialized
    // 在这里添加证书，因为SparkContext初始化之前就可能被调用
    SparkHadoopUtil.get.addCredentials(jobConf)
    try {
      // 根据传入HadoopRDD的minPartitions和jobConf信息，决定输入文件的分区个数：allInputSplits
      // getInputFormat(): textFile()的InputFormat是之前 hadoopFile(path, classOf[TextInputFormat],...) 
      //                   传入的TextInputFormat，继承自 Hadoop的FileInputFormat
      // getSplits()根据传入的JobContext和JobContext中文件系统的blockSize，先计算合适的splitSize：
      //            取 Math.max(minSize, Math.min(maxSize, blockSize))，
      //            默认splitSize 就等于JobConf中文件系统的blockSize的默认值(128M)；
      //            然后根据这个splitSize把输入文件切分成多份，partition数就是切分文件的个数。
      // Hadoop代码参考： 
      val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
      val inputSplits = if (ignoreEmptySplits) {
        allInputSplits.filter(_.getLength > 0)
      } else {
        allInputSplits
      }
      // 创建allInputSplits个Partition类型的数组，每个数组存放一个HadoopPartition对象
      // HadoopPartition是HadoopRDD的内部类 that wraps around a Hadoop InputSplit.
      // 最后返回这个数组
      val array = new Array[Partition](inputSplits.size)
      for (i <- 0 until inputSplits.size) {
        // HadoopPartition保存了(rddId, index, InputSplit)
        array(i) = new HadoopPartition(id, i, inputSplits(i))
      }
      array
    } catch {
      case e: InvalidInputException if ignoreMissingFiles =>
        logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist and no" +
            s" partitions returned from this path.", e)
        Array.empty[Partition]
    }
  }

HadoopRDD的getPreferredLocations()方法

从HadoopPartition的InputSplit信息中，获取分区所在的结点位置URI！！

override def getPreferredLocations(split: Partition): Seq[String] = {
    val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
    val locs = hsplit match {
      case lsplit: InputSplitWithLocationInfo =>
        HadoopRDD.convertSplitLocationInfo(lsplit.getLocationInfo)
      case _ => None
    }
    locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
  }

HadoopRDD的compute()方法

这是个典型的迭代器模式。迭代器封装了遍历操作的细节。

以前见过leveldb的两级迭代器设计,2013 写的真棒啊。
leveldb的TwoLevelIterator意思是：level1中的迭代器指向的是一个容器，level2中的迭代器才指向真正的元素。
它不仅可以迭代其中存储的sstable对象，它还接受了一个函数BlockFunction，可以遍历存储的Block对象数据。
这类似于C++ STL中的deque，map中每个元素都是指针，指向另一段较大的连续空间构成的缓冲区，
在这些分段的连续空间上用cur, first, last指针实现整体上连续访问。

HadoopRDD 由于有inputSplit，好像没有使用两级的迭代器模式。

override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
    val iter = new NextIterator[(K, V)] {...
    }
    new InterruptibleIterator[(K, V)](context, iter)
  }

1，从HadoopPartition获取文件信息，更新全局的InputFileBlockHolder；
2，接着定义了几个内部函数（updateBytesRead等）；
3，然后根据jobConf、inputSplit，从getInputFormat(jobConf)（也就是TextInputFormat）中调用getRecordReader()，
返回RecordReader[K, V]。

reader =
        try {
          inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
        } catch {
        }

4，拿到RecordReader后，注册了一个on-task-completion的回调，来关闭input stream。
5，再然后重写getNext()方法，返回RecordReader[K, V]的 next(key, value)，
重写了close()方法。
6，最后返回当前迭代器和上下文对象

2，RDD写文件操作

saveAsTextFile函数在RDD.scala中，是通过saveAsHadoopFile实现的。
使用mapPartitions算子将每个分区都写入Text文件，rddToPairRDDFunctions函数转化为(NullWritable,Text)类型的RDD，
然后调用PairRDDFunctions的saveAsHadoopFile写入文件。

RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
        .saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)

源码见org.apache.spark.rdd下的PairRDDFunctions.scala。

/**
   * Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
   * supporting the key and value types K and V in this RDD.
   *
   * @note We should make sure our tasks are idempotent when speculation is enabled, i.e. 
   * do not use output committer that writes data directly.
   * There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
   * result of using direct output committer with speculation enabled.
   */
  def saveAsHadoopFile(
      path: String,
      keyClass: Class[_],
      valueClass: Class[_],
      outputFormatClass: Class[_ <: OutputFormat[_, _]],
      conf: JobConf = new JobConf(self.context.hadoopConfiguration),
      codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
    // Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
    val hadoopConf = conf
    hadoopConf.setOutputKeyClass(keyClass)
    hadoopConf.setOutputValueClass(valueClass)
    conf.setOutputFormat(outputFormatClass)
    for (c <- codec) {
      hadoopConf.setCompressMapOutput(true)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true")
      hadoopConf.setMapOutputCompressorClass(c)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", c.getCanonicalName)
      hadoopConf.set("mapreduce.output.fileoutputformat.compress.type",
        CompressionType.BLOCK.toString)
    }

    // Use configured output committer if already set
    if (conf.getOutputCommitter == null) {
      hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
    }

    // When speculation is on and output committer class name contains "Direct", we should warn
    // users that they may loss data if they are using a direct output committer.
    val speculationEnabled = self.conf.getBoolean("spark.speculation", false)
    val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "")
    if (speculationEnabled && outputCommitterClass.contains("Direct")) {
      val warningMessage =
        s"$outputCommitterClass may be an output committer that writes data directly to " +
          "the final location. Because speculation is enabled, this output committer may " +
          "cause data loss (see the case in SPARK-10063). If possible, please use an output " +
          "committer that does not have this behavior (e.g. FileOutputCommitter)."
      logWarning(warningMessage)
    }

    FileOutputFormat.setOutputPath(hadoopConf,
      SparkHadoopWriterUtils.createPathFromString(path, hadoopConf))
    saveAsHadoopDataset(hadoopConf)
  }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。