dag图的spark过程 spark 图计算_运算符


GraphX是Spark中用于图形和图形并行计算的新组件。在较高的层次上,GraphX 通过引入新的Graph抽象来扩展Spark RDD:一个有向多重图,其属性附加到每个顶点和边上。为了支持图计算,GraphX公开了一组基本的操作符(例如, subgraphjoinVertices和 aggregateMessages),以及所述的优化的变体Pregel API。此外,GraphX包括越来越多的图形算法和 构建器集合,以简化图形分析任务。



import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD

如果不使用Spark Shell,则还需要一个SparkContext


GraphX的属性曲线图是一个有向多重图与连接到每个顶点和边的用户定义的对象。有向多重图是有向图,其中存在的多个平行边共享相同的源和目标顶点。支持平行边的功能简化了在相同顶点之间可能存在多个关系(例如,同事和朋友)的建模场景。每个顶点均由唯一的 64位长标识符(VertexId)设置密钥 。GraphX对顶点标识符没有施加任何排序约束。同样,边具有相应的源和目标顶点标识符。




class VertexProperty()
case class UserProperty(val name : String) extends VertexProperty
case class ProductProperty(val name : String, val price : Double) extends VertexProperty
// The graph might then have the type:
var graph : Graph[VertexProperty, String] = null



class Graph[VD, ED] {
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]

Graph的类是VertexRDD[VD]EdgeRDD[ED]的延伸,并且分别包含被优化的版本RDD[(VertexId, VD)]RDD[Edge[ED]]VertexRDD[VD]和EdgeRDD[ED]提供围绕图形计算,并利用内部优化内置附加功能。



dag图的spark过程 spark 图计算_Spark_02


val userGraph: Graph[(String, String), String]

有多种方法可以从原始文件,RDD甚至是合成生成器构造属性图。最通用的方法是使用 Graph对象。例如,以下代码从RDD集合构造一个图形:

// Assume the SparkContext has already been constructed
val sc: SparkContext
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Seq((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Seq(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)


我们可以分别使用graph.vertices 和graph.edges成员将图解构为相应的顶点和边视图。

val graph: Graph[(String, String), String] // Constructed from above
// Count all users which are postdocs
graph.vertices.filter {case (id, (name, pos)) => pos == "postdoc"}.count
// Count all the edges where src > dst
graph.edges.filter(e => e.srcId > e.dstId).count

请注意,graph.vertices返回VertexRDD[(String, String)]扩展了的 RDD[(VertexId, (String, String))],因此我们使用scala case表达式来解构元组。另一方面,graph.edges返回一个EdgeRDD包含Edge[String]对象。我们还可以使用case类类型构造函数,如下所示:

graph.edges.filter {case Edge(src, dst, prop) => src > dst}.count

除了属性图的顶点和边视图外,GraphX还公开了一个三元组视图。三元组视图在逻辑上连接顶点和边属性,从而产生一个 RDD[EdgeTriplet[VD, ED]]包含EdgeTriplet类的实例。可以用以下SQL表达式表示此连接

SELECT src.id, dst.id, src.attr, e.attr, dst.attr
FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst
ON e.srcId = src.Id AND e.dstId = dst.Id


dag图的spark过程 spark 图计算_图计算_03


EdgeTriplet类扩展Edge通过添加类srcAttr和 dstAttr分别包含源和目的属性成员。我们可以使用图形的三元组视图来呈现描述用户之间关系的字符串集合。

val graph: Graph[(String, String), String] // Constructed from above
// Use the triplets view to create an RDD of facts.
val facts: RDD[String] =
  graph.triplets.map(triplet =>
    triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)



val graph: Graph[(String, String), String]
// Use the implicit GraphOps.inDegrees operator
val inDegrees: VertexRDD[Int] = graph.inDegrees


/** Summary of the functionality in the property graph */
class Graph[VD, ED] {
  // Information about the Graph ===================================================================
  val numEdges: Long
  val numVertices: Long
  val inDegrees: VertexRDD[Int]
  val outDegrees: VertexRDD[Int]
  val degrees: VertexRDD[Int]
  // Views of the graph as collections =============================================================
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]
  val triplets: RDD[EdgeTriplet[VD, ED]]
  // Functions for caching graphs ==================================================================
  def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
  def cache(): Graph[VD, ED]
  def unpersistVertices(blocking: Boolean = false): Graph[VD, ED]
  // Change the partitioning heuristic  ============================================================
  def partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]
  // Transform vertex and edge attributes ==========================================================
  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2])
    : Graph[VD, ED2]
  // Modify the graph structure ====================================================================
  def reverse: Graph[VD, ED]
  def subgraph(
      epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
      vpred: (VertexId, VD) => Boolean = ((v, d) => true))
    : Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]
  // Join RDDs with the graph ======================================================================
  def joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD): Graph[VD, ED]
  def outerJoinVertices[U, VD2](other: RDD[(VertexId, U)])
      (mapFunc: (VertexId, VD, Option[U]) => VD2)
    : Graph[VD2, ED]
  // Aggregate information about adjacent triplets =================================================
  def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]]
  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]]
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[A]
  // Iterative graph-parallel computation ==========================================================
  def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
      vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED]
  // Basic graph algorithms ========================================================================
  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
  def connectedComponents(): Graph[VertexId, ED]
  def triangleCount(): Graph[Int, ED]
  def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]



class Graph[VD, ED] {
  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]



val newVertices = graph.vertices.map {case (id, attr) => (id, mapUdf(id, attr))}
val newGraph = Graph(newVertices, graph.edges)


val newGraph = graph.mapVertices((id, attr) => mapUdf(id, attr))


// Given a graph where the vertex property is the out degree
val inputGraph: Graph[Int, String] =
  graph.outerJoinVertices(graph.outDegrees)((vid, _, degOpt) => degOpt.getOrElse(0))
// Construct a graph where each edge contains the weight
// and each vertex is the initial PageRank
val outputGraph: Graph[Double, Double] =
  inputGraph.mapTriplets(triplet => 1.0 / triplet.srcAttr).mapVertices((id, _) => 1.0)



class Graph[VD, ED] {
  def reverse: Graph[VD, ED]
  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
               vpred: (VertexId, VD) => Boolean): Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED] // 交集
  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]


该subgraph操作需要的顶点和边的谓词,并返回包含只有满足顶点谓词的顶点和满足边谓词边的曲线和满足顶点谓词连接顶点subgraph 可以在多种情况下使用该运算符,以将图形限制在感兴趣的顶点和边或消除断开的链接。例如,在下面的代码中,我们删除了断开的链接:

// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Seq((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
                       (4L, ("peter", "student"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Seq(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")
// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)
// Notice that there is a user 0 (for which we have no information) connected to users
// 4 (peter) and 5 (franklin).
  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
// Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
// The valid subgraph will disconnect users 4 and 5 by removing user 0
  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1


mask操作通过返回包含该顶点和边,它们也在输入图形中发现曲线构造一个子图。可以与subgraph运算符结合使用, 以基于另一个相关图形中的属性来限制图形。例如,我们可能会使用缺少顶点的图来运行连接的组件,然后将答案限制为有效的子图。

// Run Connected Components
val ccGraph = graph.connectedComponents() // No longer contains missing field
// Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
// Restrict the answer to the valid subgraph
val validCCGraph = ccGraph.mask(validGraph)

属性图的groupEdges操作在多重图中合并平行边(即,顶点对之间的重复边缘)。在许多数值应用中,可以将平行边添加 (合并了它们的权重)到单个边中,从而减小了图形的大小。




class Graph[VD, ED] {
  def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD)
    : Graph[VD, ED]
  def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2)
    : Graph[VD2, ED]



val nonUniqueCosts: RDD[(VertexId, Double)]
val uniqueCosts: VertexRDD[Double] =
  graph.vertices.aggregateUsingIndex(nonUnique, (a,b) => a + b)
val joinedGraph = graph.joinVertices(uniqueCosts)(
  (id, oldCost, extraCost) => oldCost + extraCost)


val outDegrees: VertexRDD[Int] = graph.outDegrees
val degreeGraph = graph.outerJoinVertices(outDegrees){(id, oldAttr, outDegOpt) =>
  outDegOpt match {
    case Some(outDeg) => outDeg
    case None => 0 // No outDegree means zero outDegree


val joinedGraph = graph.joinVertices(uniqueCosts,
  (id: VertexId, oldCost: Double, extraCost: Double) => oldCost + extraCost)



为了提高性能,主要聚合运算符从更改 graph.mapReduceTripletsgraph.AggregateMessages



class Graph[VD, ED] {
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[Msg]

用户定义的sendMsg函数采用EdgeContext,将公开源和目标属性以及边属性和函数(sendToSrcsendToDst),以将消息发送到源和目标节点。sendMsg可以认为是 map-reduce中的map函数。用户定义的mergeMsg函数接受两条发往同一顶点的消息,并产生一条消息。可以认为是map-reduce中的reduce函数。Graph的 aggregateMessages操作返回一个VertexRDD[Msg] ,包含发往每个顶点的聚合消息(类型的Msg)。未收到消息的顶点不包含在返回的VertexRDD中

另外,aggregateMessages采用一个可选参数 tripletsFields,该参数指示访问哪些数据EdgeContext (即,源顶点属性,而不是目标顶点属性)。Graph的可能选项在tripletsFields中定义,TripletFields默认值为TripletFields.All,指示用户定义的sendMsg函数可以访问任何顶点。该tripletFields参数可用于限制GraphX仅访问部分顶点, EdgeContext允许GraphX选择优化的联接策略。例如,如果我们正在计算每个用户的关注者的平均年龄,则仅需要源字段,因此我们可以TripletFields.Src用来表明我们仅需要源字段。

在GraphX的早期版本中,我们使用字节码检查来推断 TripletFields,但是我们发现字节码检查有些不可靠,而是选择了更明确的用户控制。


package spark2.graphx

import org.apache.log4j.{Level, Logger}
import org.apache.spark.graphx.{Graph, VertexRDD}
import org.apache.spark.graphx.util.GraphGenerators
import org.apache.spark.sql.SparkSession

object AggregateMessagesExample {
  def main(args: Array[String]): Unit = {
    // Creates a SparkSession.
    val spark = SparkSession
    val sc = spark.sparkContext

    // 随机生成一个图
    val graph: Graph[Double, Int] =
      GraphGenerators.logNormalGraph(sc, numVertices = 5).mapVertices((id, _) => id.toDouble)
    // Compute the number of older followers and their total age
    val olderFollowers: VertexRDD[(Int, Double)] = graph.aggregateMessages[(Int, Double)](
      triplet => { // Map Function
        if (triplet.srcAttr > triplet.dstAttr) {
          // Send message to destination vertex containing counter and age
          triplet.sendToDst((1, triplet.srcAttr))
      // Add counter and age
      (a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
    // Divide total age by number of older followers to get average age of older followers
    val avgAgeOfOlderFollowers: VertexRDD[Double] =
      olderFollowers.mapValues( (id, value) =>
        value match {case (count, totalAge) => totalAge / count})
    // Display the results
    // $example off$



dag图的spark过程 spark 图计算_Spark_04