1,spark读文件流程
从本地读取txt文件:
// path最后可以是文件或文件夹,还可以用通配符
val path = “file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/”
val rdd1 = sparkcontext.textFile(path,2)
从hdfs读取文件:sparkcontext.textFile("hdfs://s1:8020/user/hdfs/input”)
textFile()
Read a text file from HDFS, a local file system (available on all nodes),
or any Hadoop-supported file system URI, and return it as an RDD of Strings.
那么,使用textfile读取文件时候,到底是根据什么分区的呢?分区数和分区大小又是多少?
textfile返回RDD的Key、Value都是由InputFormat决定的。由传入hadoopFile()的参数可知,
value是Text类型的,即lines of the text file,那么key的类型LongWritable是什么意思呢?
表示当前(每个split)在文件中的偏移量,也就是每个分区的起始Offset。
主要逻辑如下:
org.apache.spark.SparkContext.scala
传入text/HDFS文件的路径和建议的最小分区数:默认值是2,实际可能比这个要大,比如文件特别多或者特别大时,也可能更小。
返回’RDD of lines of the text file’,并把hadoopFile的(key, value) map()成 value.toString,只返回文本行。
def textFile(
path: String,
minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
assertNotStopped()
hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
minPartitions).map(pair => pair._2.toString).setName(path)
}
def hadoopFile[K, V](
path: String,
inputFormatClass: Class[_ <: InputFormat[K, V]],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
assertNotStopped()
// Hadoop的FileSystem,来自:org.apache.hadoop.fs.{FileSystem}
// This is a hack to enforce loading hdfs-site.xml.
FileSystem.getLocal(hadoopConfiguration)
// 一个Hadoop配置文件能达到10KB,需要设为广播变量。
val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
new HadoopRDD(
this,
confBroadcast,
Some(setInputPathsFunc),
inputFormatClass,
keyClass,
valueClass,
minPartitions).setName(path)
}
HadoopRDD继承了RDD,使用旧的MapReduce API (‘org.apache.hadoop.mapred’),
提供了读取存储在Hadoop上的数据的核心功能,
重载了getPartitions/compute/getPreferredLocations/persist方法。
见 org.apache.spark.rdd.HadoopRDD.scala
HadoopRDD的getPartitions()方法
输入数据会被切分成N份,每份数据对应到RDD中的一个Partition,Partition的数量决定了task的数量,影响着程序的并行度。
HadoopRDD的getPartitions()方法解释了spark partition是怎么决定的。
添加了注释:
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
// 在这里添加证书,因为SparkContext初始化之前就可能被调用
SparkHadoopUtil.get.addCredentials(jobConf)
try {
// 根据传入HadoopRDD的minPartitions和jobConf信息,决定输入文件的分区个数:allInputSplits
// getInputFormat(): textFile()的InputFormat是之前 hadoopFile(path, classOf[TextInputFormat],...)
// 传入的TextInputFormat,继承自 Hadoop的FileInputFormat
// getSplits()根据传入的JobContext和JobContext中文件系统的blockSize,先计算合适的splitSize:
// 取 Math.max(minSize, Math.min(maxSize, blockSize)),
// 默认splitSize 就等于JobConf中文件系统的blockSize的默认值(128M);
// 然后根据这个splitSize把输入文件切分成多份,partition数就是切分文件的个数。
// Hadoop代码参考:
val allInputSplits = getInputFormat(jobConf).getSplits(jobConf, minPartitions)
val inputSplits = if (ignoreEmptySplits) {
allInputSplits.filter(_.getLength > 0)
} else {
allInputSplits
}
// 创建allInputSplits个Partition类型的数组,每个数组存放一个HadoopPartition对象
// HadoopPartition是HadoopRDD的内部类 that wraps around a Hadoop InputSplit.
// 最后返回这个数组
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
// HadoopPartition保存了(rddId, index, InputSplit)
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
} catch {
case e: InvalidInputException if ignoreMissingFiles =>
logWarning(s"${jobConf.get(FileInputFormat.INPUT_DIR)} doesn't exist and no" +
s" partitions returned from this path.", e)
Array.empty[Partition]
}
}
HadoopRDD的getPreferredLocations()方法
从HadoopPartition的InputSplit信息中,获取分区所在的结点位置URI!!
override def getPreferredLocations(split: Partition): Seq[String] = {
val hsplit = split.asInstanceOf[HadoopPartition].inputSplit.value
val locs = hsplit match {
case lsplit: InputSplitWithLocationInfo =>
HadoopRDD.convertSplitLocationInfo(lsplit.getLocationInfo)
case _ => None
}
locs.getOrElse(hsplit.getLocations.filter(_ != "localhost"))
}
HadoopRDD的compute()方法
这是个典型的迭代器模式。迭代器封装了遍历操作的细节。
以前见过leveldb的两级迭代器设计,2013 写的真棒啊。
leveldb的TwoLevelIterator意思是:level1中的迭代器指向的是一个容器,level2中的迭代器才指向真正的元素。
它不仅可以迭代其中存储的sstable对象,它还接受了一个函数BlockFunction,可以遍历存储的Block对象数据。
这类似于C++ STL中的deque,map中每个元素都是指针,指向另一段较大的连续空间构成的缓冲区,
在这些分段的连续空间上用cur, first, last指针实现整体上连续访问。
HadoopRDD 由于有inputSplit,好像没有使用两级的迭代器模式。
override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
val iter = new NextIterator[(K, V)] {...
}
new InterruptibleIterator[(K, V)](context, iter)
}
1,从HadoopPartition获取文件信息,更新全局的InputFileBlockHolder;
2,接着定义了几个内部函数(updateBytesRead等);
3,然后根据jobConf、inputSplit,从getInputFormat(jobConf)(也就是TextInputFormat)中调用getRecordReader(),
返回RecordReader[K, V]。
reader =
try {
inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
} catch {
}
4,拿到RecordReader后,注册了一个on-task-completion的回调,来关闭input stream。
5,再然后重写getNext()方法,返回RecordReader[K, V]的 next(key, value),
重写了close()方法。
6,最后返回当前迭代器和上下文对象
2,RDD写文件操作
saveAsTextFile函数在RDD.scala中,是通过saveAsHadoopFile实现的。
使用mapPartitions算子将每个分区都写入Text文件,rddToPairRDDFunctions函数转化为(NullWritable,Text)类型的RDD,
然后调用PairRDDFunctions的saveAsHadoopFile写入文件。
RDD.rddToPairRDDFunctions(r)(nullWritableClassTag, textClassTag, null)
.saveAsHadoopFile[TextOutputFormat[NullWritable, Text]](path)
源码见org.apache.spark.rdd下的PairRDDFunctions.scala。
/**
* Output the RDD to any Hadoop-supported file system, using a Hadoop `OutputFormat` class
* supporting the key and value types K and V in this RDD.
*
* @note We should make sure our tasks are idempotent when speculation is enabled, i.e.
* do not use output committer that writes data directly.
* There is an example in https://issues.apache.org/jira/browse/SPARK-10063 to show the bad
* result of using direct output committer with speculation enabled.
*/
def saveAsHadoopFile(
path: String,
keyClass: Class[_],
valueClass: Class[_],
outputFormatClass: Class[_ <: OutputFormat[_, _]],
conf: JobConf = new JobConf(self.context.hadoopConfiguration),
codec: Option[Class[_ <: CompressionCodec]] = None): Unit = self.withScope {
// Rename this as hadoopConf internally to avoid shadowing (see SPARK-2038).
val hadoopConf = conf
hadoopConf.setOutputKeyClass(keyClass)
hadoopConf.setOutputValueClass(valueClass)
conf.setOutputFormat(outputFormatClass)
for (c <- codec) {
hadoopConf.setCompressMapOutput(true)
hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true")
hadoopConf.setMapOutputCompressorClass(c)
hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec", c.getCanonicalName)
hadoopConf.set("mapreduce.output.fileoutputformat.compress.type",
CompressionType.BLOCK.toString)
}
// Use configured output committer if already set
if (conf.getOutputCommitter == null) {
hadoopConf.setOutputCommitter(classOf[FileOutputCommitter])
}
// When speculation is on and output committer class name contains "Direct", we should warn
// users that they may loss data if they are using a direct output committer.
val speculationEnabled = self.conf.getBoolean("spark.speculation", false)
val outputCommitterClass = hadoopConf.get("mapred.output.committer.class", "")
if (speculationEnabled && outputCommitterClass.contains("Direct")) {
val warningMessage =
s"$outputCommitterClass may be an output committer that writes data directly to " +
"the final location. Because speculation is enabled, this output committer may " +
"cause data loss (see the case in SPARK-10063). If possible, please use an output " +
"committer that does not have this behavior (e.g. FileOutputCommitter)."
logWarning(warningMessage)
}
FileOutputFormat.setOutputPath(hadoopConf,
SparkHadoopWriterUtils.createPathFromString(path, hadoopConf))
saveAsHadoopDataset(hadoopConf)
}