spark中的默认存储级别 spark缓存级别

转载

人类新新 2024-01-26 22:19:37

文章标签 spark中的默认存储级别 Spark Spark RDD 缓存 RDD 缓存级别缓存 文章分类 Spark 大数据

Spark相比Hadoop最大的一个优势就是可以将数据cache到内存，以供后面的计算使用。本文将对这部分的代码进行分析。

　　我们可以通过rdd.persist()或rdd.cache()来缓存RDD中的数据，cache()其实就是调用persist()实现的。persist()支持下面的几种存储级别：

val NONE = new StorageLevel( false , false , false , false ) val DISK _ ONLY = new StorageLevel( true , false , false , false ) val DISK _ ONLY _ 2 = new StorageLevel( true , false , false , false , 2 ) val MEMORY _ ONLY = new StorageLevel( false , true , false , true ) val MEMORY _ ONLY _ 2 = new StorageLevel( false , true , false , true , 2 ) val MEMORY _ ONLY _ SER = new StorageLevel( false , true , false , false ) val MEMORY _ ONLY _ SER _ 2 = new StorageLevel( false , true , false , false , 2 ) val MEMORY _ AND _ DISK = new StorageLevel( true , true , false , true ) val MEMORY _ AND _ DISK _ 2 = new StorageLevel( true , true , false , true , 2 ) val MEMORY _ AND _ DISK _ SER = new StorageLevel( true , true , false , false ) val MEMORY _ AND _ DISK _ SER _ 2 = new StorageLevel( true , true , false , false , 2 ) val OFF _ HEAP = new StorageLevel( false , false , true , false )

persist(StorageLevel.MEMORY_ONLY)，也就是默认的缓存级别。我们可以根据自己的需要去设置不同的缓存级别，这里各种缓存级别的含义我就不介绍了，可以参见官方文档说明。

通过调用rdd.persist()来缓存RDD中的数据，其最终调用的都是下面的代码：

/ private def persist(newLevel : StorageLevel, allowOverride : Boolean) : this . type = { // TODO: Handle changes of StorageLevel if (storageLevel ! = StorageLevel.NONE && newLevel ! = storageLevel && !allowOverride) { throw new UnsupportedOperationException( "Cannot change storage level of an RDD after it was already assigned a level" ) } // If this is the first time this RDD is marked for persisting, register it // with the <span ><a href="https://www.iteblog.com/archives/tag/spark/" title="" target="_blank" data-original-title="View all posts in Spark">Spark</a></span>Context for cleanups and accounting. Do this only once. if (storageLevel == StorageLevel.NONE) { sc.cleaner.foreach( _ .registerRDDForCleanup( this )) sc.persistRDD( this ) } storageLevel = newLevel this }

org.apache.spark.scheduler.Task#run，而这个方法最后会调用ResultTask或者ShuffleMapTask的runTask方法，runTask方法最后会调用org.apache.spark.rdd.RDD#iterator方法，iterator的代码如下：

final def iterator(split : Partition, context : TaskContext) : Iterator[T] = { if (storageLevel ! = StorageLevel.NONE) { < span class = "wp_keywordlink_affiliate" >< a href = "https://www.iteblog.com/archives/tag/spark/" title = "" target = "_blank" data-original-title = "View all posts in Spark" > Spark < /a >< /span > Env.get.cacheManager.getOrCompute( this , split, context, storageLevel) } else { computeOrReadCheckpoint(split, context) } }

如果当前RDD设置了存储级别（也就是通过上面的rdd.persist()设置的），那么会从cacheManager中判断是否有缓存数据。如果有，则直接获取，如果没有则计算。getOrCompute的代码如下：

def getOrCompute[T]( rdd : RDD[T], partition : Partition, context : TaskContext, storageLevel : StorageLevel) : Iterator[T] = { val key = RDDBlockId(rdd.id, partition.index) logDebug(s "Looking for partition $key" ) blockManager.get(key) match { case Some(blockResult) = > // Partition is already materialized, so just return its values val existingMetrics = context.taskMetrics .getInputMetricsForReadMethod(blockResult.readMethod) existingMetrics.incBytesRead(blockResult.bytes) val iter = blockResult.data.asInstanceOf[Iterator[T]] new InterruptibleIterator[T](context, iter) { override def next() : T = { existingMetrics.incRecordsRead( 1 ) delegate.next() } } case None = > // Acquire a lock for loading this partition // If another thread already holds the lock, wait for it to finish return its results val storedValues = acquireLockForPartition[T](key) if (storedValues.isDefined) { return new InterruptibleIterator[T](context, storedValues.get) } // Otherwise, we have to load the partition ourselves try { logInfo(s "Partition $key not found, computing it" ) val computedValues = rdd.computeOrReadCheckpoint(partition, context) // If the task is running locally, do not persist the result if (context.isRunningLocally) { return computedValues } // Otherwise, cache the values and keep track of any updates in block statuses val updatedBlocks = new ArrayBuffer[(BlockId, BlockStatus)] val cachedValues = putInBlockManager(key, computedValues, storageLevel, updatedBlocks) val metrics = context.taskMetrics val lastUpdatedBlocks = metrics.updatedBlocks.getOrElse(Seq[(BlockId, BlockStatus)]()) metrics.updatedBlocks = Some(lastUpdatedBlocks ++ updatedBlocks.toSeq) new InterruptibleIterator(context, cachedValues) } finally { loading.synchronized { loading.remove(key) loading.notifyAll() } } } }

rdd.computeOrReadCheckpoint(partition, context)计算当前分区的数据，并放计算完的数据放到BlockManager中，如果有相关的线程等待该分区的计算，那么在计算完数据之后还得通知它们（loading.notifyAll()）。

如果获取锁失败，则说明已经有其他线程在计算该分区中的数据了，那么我们就得等（loading.wait()），获取锁的代码如下：

/ private def acquireLockForPartition[T](id : RDDBlockId) : Option[Iterator[T]] = { loading.synchronized { if (!loading.contains(id)) { // If the partition is free, acquire its lock to compute its value loading.add(id) None } else { // Otherwise, wait for another thread to finish and return its result logInfo(s "Another thread is loading $id, waiting for it to finish..." ) while (loading.contains(id)) { try { loading.wait() } catch { case e : Exception = > logWarning(s "Exception while waiting for another thread to load $id" , e) } } logInfo(s "Finished waiting for $id" ) val values = blockManager.get(id) if (!values.isDefined) { /* The block is not guaranteed to exist even after the other thread has finished. * For instance, the block could be evicted after it was put, but before our get. * In this case, we still need to load the partition ourselves. */ logInfo(s "Whoever was loading $id failed; we'll try it ourselves" ) loading.add(id) } values.map( _ .data.asInstanceOf[Iterator[T]]) } } }