spark清空表 spark delete

转载

mob64ca13f53d41 2023-10-24 21:52:53

文章标签 spark清空表 spark 缓存清理缓存数据 文章分类 Spark 大数据

unpersist
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#unpersist
Dematerializes the RDD (i.e. Erases all data items from hard-disk and memory). However, the RDD object remains. If it is referenced in a computation, Spark will regenerate it automatically using the stored dependency graph.

取消实现RDD（即从硬盘和内存中擦除所有数据项）。但是，RDD对象仍然保留。如果在计算中引用它，Spark将使用存储的依赖图自动重新生成它。

Listing Variants

def unpersist(blocking: Boolean = true): RDD[T]

Example

val y = sc.parallelize(1 to 10, 10)
val z = (y++y)
z.collect
z.unpersist(true)
14/04/19 03:04:57信息UnionRDD:从列表删除抽样22持久性
14/04/19 03:04:57 INFO BlockManager: Removing RDD 22

Spark缓存清理机制：
MetadataCleaner对象中有一个定时器，用于清理下列的元数据信息：
MAP_OUTPUT_TRACKER：Maptask的输出元信息
SPARK_CONTEXT：persistentRdds中的rdd
HTTP_BROADCAST, http广播的元数据
BLOCK_MANAGER：blockmanager中存储的数据
SHUFFLE_BLOCK_MANAGER：shuffle的输出数据
BROADCAST_VARS：Torrent方式广播broadcast的元数据

contextcleaner清理真实数据:
ContextCleaner为RDD、shuffle、broadcast、accumulator、Checkpoint维持了一个弱引用，当相关对象不可达时，就会将对象插入referenceQueue中。有一个单独的线程来处理这个队列中的对象。
RDD：最终从各节点的blockmanager的memoryStore、diskStore中删除RDD数据
shuffle：删除driver中的mapstatuses关于该shuffleId的信息；删除所有节点中关于该shuffleId的所有分区的数据文件和索引文件
broadcast:最终从各节点的blockmanager的memoryStore、diskStore中删除broadcast数据
Checkpoint：清理checkpointDir目录下关于该rddId的文件
举个RDD的例子，说明一下这样做有什么好处？
默认情况下，RDD是不缓存的，即计算完之后，下一次用需要重新计算。如果要避免重新计算的开销，就要将RDD缓存起来，这个道理谁都明白。但是，缓存的RDD什么时候去释放呢？这就用到了上面提到的弱引用。当我们调用persist缓存一个RDD时，会调用registerRDDForCleanup(this)，这就是将本身的RDD注册到一个弱引用中。当这个RDD变为不可达时，会自动将该RDD对象插入到referenceQueue中，等到下次GC时就会走doCleanupRDD分支。RDD可能保存在内存或者磁盘中，这样就能保证，不可达的RDD在GC到来时可以释放blockmanager中的RDD真实数据。
再考虑一下，什么时候RDD不可达了呢？为了让出内存供其他地方使用，除了手动unpersist之外，需要有机制定时清理缓存的RDD数据，这就是MetadataCleaner的SPARK_CONTEXT干的事情。它就是定期的清理persistentRdds中过期的数据，其实与unpersist产生的作用是一样的。一旦清理了，那这个缓存的RDD就没有强引用了。
spark core 2.0 ContextCleaner

CleanupTaskWeakReference是整个cleanup的核心，这是一个普通类，有三个成员，并且继承WeakReference。referent是一个weak reference，当指向的对象被gc时，把CleanupTaskWeakReference放到队列里，详情见：java WeekReference ReferenceQueue测试

[html] view plain copy 
 /** 
 * A WeakReference associated with a CleanupTask. 
 * 
 * When the referent object becomes only weakly reachable, the corresponding 
 * CleanupTaskWeakReference is automatically added to the given reference queue. 
 */ 
 private class CleanupTaskWeakReference( 
 val task: CleanupTask, 
 referent: AnyRef, 
 referenceQueue: ReferenceQueue[AnyRef]) 
 extends WeakReference(referent, referenceQueue)

以RDD为例，当注册一个RDD的cleanup的操作时，初始化一个对象CleanRDD(rdd.id)，这样就记住了哪个rdd被回收。

[html] view plain copy 
 /* Register a RDD for cleanup when it is garbage collected. / 
 def registerRDDForCleanup(rdd: RDD[_]): Unit = { 
 registerForCleanup(rdd, CleanRDD(rdd.id)) 
 } 
 在regitsterForCleanup里，把CleanupTaskWeakReference放到referenceBuffer里，防止被回收。 
 [html] view plain copy 
 /* Register an object for cleanup. / 
 private def registerForCleanup(objectForCleanup: AnyRef, task: CleanupTask): Unit = { 
 referenceBuffer.add(new CleanupTaskWeakReference(task, objectForCleanup, referenceQueue)) 
 } 
 keepClean不断从referenceQueue里取回收的对象，调用相应的方法处理。reference.get返回CleanupTaskWeakReference。[html] view plain copy 
 private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) { 
 while (!stopped) { 
 try { 
 val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT)) 
 .map(_.asInstanceOf[CleanupTaskWeakReference]) 
 // Synchronize here to avoid being interrupted on stop() 
 synchronized { 
 reference.map(_.task).foreach { task => 
 logDebug(“Got cleaning task ” + task) 
 referenceBuffer.remove(reference.get) 
 task match { 
 case CleanRDD(rddId) => 
 doCleanupRDD(rddId, blocking = blockOnCleanupTasks) 
 case CleanShuffle(shuffleId) => 
 doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks) 
 case CleanBroadcast(broadcastId) => 
 doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks) 
 case CleanAccum(accId) => 
 doCleanupAccum(accId, blocking = blockOnCleanupTasks) 
 case CleanCheckpoint(rddId) => 
 doCleanCheckpoint(rddId) 
 } 
 } 
 } 
 } catch { 
 case ie: InterruptedException if stopped => // ignore 
 case e: Exception => logError(“Error in cleaning thread”, e) 
 } 
 } 
 } doCleaRDD方法里，负责具体的清理。 
 [html] view plain copy 
 /* Perform RDD cleanup. / 
 def doCleanupRDD(rddId: Int, blocking: Boolean): Unit = { 
 try { 
 logDebug(“Cleaning RDD ” + rddId) 
 sc.unpersistRDD(rddId, blocking) 
 listeners.asScala.foreach(_.rddCleaned(rddId)) 
 logInfo(“Cleaned RDD ” + rddId) 
 } catch { 
 case e: Exception => logError(“Error cleaning RDD ” + rddId, e) 
 } 
 }

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。