Spark-SQL-core

@(spark)[sql|execution]
整个spark-sql的作用就是完成SQL语句到spark api的转换过程。整体的流程见SQLContext一节的分析。

SQLContext

/**                                                                                                                                                                     
 * The entry point for working with structured data (rows and columns) in Spark.  Allows the                                                                            
 * creation of [[DataFrame]] objects as well as the execution of SQL queries.                                                                                           
 *                                                                                                                                                                                                                                                              
 */                                                                                                                                                                     
class SQLContext(@transient val sparkContext: SparkContext)                                                                                                             
  extends org.apache.spark.Logging                                                                                                                                      
  with Serializable {

SQLContext不是从SparkContext继承来的,而是一个单独的class
做了下面的事情:
* @groupname basic Basic Operations
* @groupname ddl_ops Persistent Catalog DDL
* @groupname cachemgmt Cached Table Management
* @groupname genericdata Generic Data Sources
* @groupname specificdata Specific Data Sources
* @groupname config Configuration
* @groupname dataframes Custom DataFrame Creation
* @groupname Ungrouped Support functions for language integrated queries.

SQLContext的member也很多,比如:catalog,ddlParser,sqlParser,optimizer等等,都是用来具体干活的。完成SQL的逻辑

parse之后的结果会放到一个LogicalPlan里面去,
在获得LogicalPlan之后,最核心的逻辑在于:

/**                                                                                                                                                                   
   * :: DeveloperApi ::                                                                                                                                                 
   * The primary workflow for executing relational queries using Spark.  Designed to allow easy                                                                         
   * access to the intermediate phases of query execution for developers.                                                                                               
   */                                                                                                                                                                   
  @DeveloperApi                                                                                                                                                         
  protected[sql] class QueryExecution(val logical: LogicalPlan) {                                                                                                       
    def assertAnalyzed(): Unit = analyzer.checkAnalysis(analyzed)                                                                                                       

    // resolve 和简单机械的优化                                                                                                                                                                   
    lazy val analyzed: LogicalPlan = analyzer(logical) 
    // 取cached plan                                                                                                                 
    lazy val withCachedData: LogicalPlan = {                                                                                                                            
      assertAnalyzed()                                                                                                                                                  
      cacheManager.useCachedData(analyzed)                                                                                                                              
    }                    
    // 查询优化                                                                                                                                               
    lazy val optimizedPlan: LogicalPlan = optimizer(withCachedData)                                                                                                     

    // TODO: Don't just pick the first one...  
    // 生成spark的plan,完成logicalPlan-> PhysicalPlan的转化。
    // 按传统关系数据库的做法,这个地方应该是引入cost based的优化方法,目前看似乎没有。。。                                                                                                                         
    lazy val sparkPlan: SparkPlan = {                                                                                                                                   
      SparkPlan.currentContext.set(self)                                                                                                                                
      planner(optimizedPlan).next()                                                                                                                                     
    }                                                                                                                                                                   
    // executedPlan should not be used to initialize any SparkPlan. It should be                                                                                        
    // only used for execution.   
    // 插入shuffle的过程                                                                                                                                      
    lazy val executedPlan: SparkPlan = prepareForExecution(sparkPlan)

做几点详细说明:
- parser 先做ddlparser如果失败那么就做sqlparser的parse

protected[sql] def parseSql(sql: String): LogicalPlan = {                                                                                                             
    ddlParser(sql, false).getOrElse(sqlParser(sql))                                                                                                                     
}
  • analyzer
protected[sql] lazy val analyzer: Analyzer =                                                                                                                          
    new Analyzer(catalog, functionRegistry, caseSensitive = true) {                                                                                                     
      override val extendedResolutionRules =                                                                                                                            
        ExtractPythonUdfs ::                                                                                                                                            
        sources.PreInsertCastAndRename ::                                                                                                                               
        Nil                                                                                                                                                             

      override val extendedCheckRules = Seq(                                                                                                                            
        sources.PreWriteCheck(catalog)                                                                                                                                  
      )                                                                                                                                                                 
    }
  • Optimizer = DefaultOptimizer
object DefaultOptimizer extends Optimizer {                                                                                                                             
      val batches =                                                                                                                                                         
        // SubQueries are only needed for analysis and can be removed before        execution.                                                                                     
    Batch("Remove SubQueries", FixedPoint(100),                                                                                                                         
      EliminateSubQueries) ::                                                                                                                                           
    Batch("Combine Limits", FixedPoint(100),                                                                                                                            
      CombineLimits) ::                                                                                                                                                 
    Batch("ConstantFolding", FixedPoint(100),                                                                                                                           
      NullPropagation,                                                                                                                                                  
      ConstantFolding,                                                                                                                                                  
      LikeSimplification,                                                                                                                                               
      BooleanSimplification,                                                                                                                                            
      SimplifyFilters,                                                                                                                                                  
      SimplifyCasts,                                                                                                                                                    
      SimplifyCaseConversionExpressions,                                                                                                                                
      OptimizeIn) ::                                                                                                                                                    
    Batch("Decimal Optimizations", FixedPoint(100),                                                                                                                     
      DecimalAggregates) ::                                                                                                                                             
    Batch("Filter Pushdown", FixedPoint(100),                                                                                                                           
      UnionPushdown,                                                                                                                                                    
      CombineFilters,                                                                                                                                                   
      PushPredicateThroughProject,                                                                                                                                      
      PushPredicateThroughJoin,                                                                                                                                         
      PushPredicateThroughGenerate,                                                                                                                                     
      ColumnPruning) ::                                                                                                                                                 
    Batch("LocalRelation", FixedPoint(100),                                                                                                                             
      ConvertToLocalRelation) :: Nil                                                                                                                                    
    }
  • planner
protected[sql] class SparkPlanner extends SparkStrategies {                                                                                                           
    val sparkContext: SparkContext = self.sparkContext                                                                                                                  

    def strategies: Seq[Strategy] =                                                                                                                                     
      experimental.extraStrategies ++ (                                                                                                                                 
      DataSourceStrategy ::                                                                                                                                             
      DDLStrategy ::                                                                                                                                                    
      TakeOrdered ::                                                                                                                                                    
      HashAggregation ::                                                                                                                                                
      LeftSemiJoin ::                                                                                                                                                   
      HashJoin ::                                                                                                                                                       
      InMemoryScans ::                                                                                                                                                  
      ParquetOperations ::                                                                                                                                              
      BasicOperators ::                                                                                                                                                 
      CartesianProduct ::                                                                                                                                               
      BroadcastNestedLoopJoin :: Nil)   
    ```
1.  prepareForExecution,其中AddExchange的工作就是增加必要的shuffle
    ```
     /**                                                                                                                                                                    Prepares a planned SparkPlan for execution by inserting shuffle operations as needed.                                                                              
       */                                                                                                                                                                   
      @transient                                                                                                                                                            
         protected[sql] val prepareForExecution = new RuleExecutor[SparkPlan] {                                                                                                
        val batches =                                                                                                                                                       
          Batch("Add exchange", Once, AddExchange(self)) :: Nil                                                                                                             
      }

以上这些optimizer,planner等实际上都是ruleEngine,ruleEngine的apply的过程实际上就是按着某种顺序(先根/后根)遍历整棵LogicalPlan Tree,在某个LogicalPlan上使用具体的rule来完成操作。

和SparkCore的结合

最终的结果是得到一个SparkPlan
1. abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {,它有个重要的函数就是 def execute(): RDD[Row]
1. 继承它的trait包括LeafNode, UnaryNode, BinaryNode而上面生成的所有Physial Plan都extends了上面的接口,也就是说:Physical Plan 相当于把逻辑作用在一系列的RDD[Row]上面。
1. invoke Physical Plan Tree的root node就会invoke整棵树的execute。

得到的Physical Plan Tree可以理解为若干段spark core的代码。不严格说,类似于生成如下代码:

val Input = sc.textFile(...)
    // Filter 
    val rows = Input.map(....)
    // Project
    rows = rows.map(...)
    // Join, Exchange
    rows  = rows.repartition(...)
    // Project
    rows = rows.map(...)

Summary

  • 整个spark-sql的作用就是完成SQL语句到spark api的转换过程。
  • DataFrame封装了数据+schema,更像一个逻辑上的表的概念。

Source

A set of APIs for adding data sources to Spark SQL.

interfaces

一堆interface,调重要的描述一下

/**                                                                                                                                                                     
 * ::DeveloperApi::                                                                                                                                                     
 * Implemented by objects that produce relations for a specific kind of data source.  When                                                                              
 * Spark SQL is given a DDL operation with a USING clause specified (to specify the implemented                                                                         
 * RelationProvider), this interface is used to pass in the parameters specified by a user.                                                                             
 *                                                                                                                                                                      
 * Users may specify the fully qualified class name of a given data source.  When that class is                                                                         
 * not found Spark SQL will append the class name `DefaultSource` to the path, allowing for                                                                             
 * less verbose invocation.  For example, 'org.apache.spark.sql.json' would resolve to the                                                                              
 * data source 'org.apache.spark.sql.json.DefaultSource'                                                                                                                
 *                                                                                                                                                                      
 * A new instance of this class with be instantiated each time a DDL call is made.                                                                                      
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
trait RelationProvider {                                                                                                                                                
  /**                                                                                                                                                                   
   * Returns a new base relation with the given parameters.                                                                                                             
   * Note: the parameters' keywords are case insensitive and this insensitivity is enforced                                                                             
   * by the Map that is passed to the function.                                                                                                                         
   */                                                                                                                                                                   
  def createRelation(sqlContext: SQLContext, parameters: Map[String, String]): BaseRelation                                                                             
}

获得relation

/**                                                                                                                                                                     
 * ::DeveloperApi::                                                                                                                                                     
 * Represents a collection of tuples with a known schema. Classes that extend BaseRelation must                                                                         
 * be able to produce the schema of their data in the form of a [[StructType]]. Concrete                                                                                
 * implementation should inherit from one of the descendant `Scan` classes, which define various                                                                        
 * abstract methods for execution.                                                                                                                                      
 *                                                                                                                                                                      
 * BaseRelations must also define a equality function that only returns true when the two                                                                               
 * instances will return the same data. This equality function is used when determining when                                                                            
 * it is safe to substitute cached results for a given relation.                                                                                                        
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
abstract class BaseRelation {                                                                                                                                           
  def sqlContext: SQLContext                                                                                                                                            
  def schema: StructType                                                                                                                                                

  /**                                                                                                                                                                   
   * Returns an estimated size of this relation in bytes. This information is used by the planner                                                                       
   * to decided when it is safe to broadcast a relation and can be overridden by sources that                                                                           
   * know the size ahead of time. By default, the system will assume that tables are too                                                                                
   * large to broadcast. This method will be called multiple times during query planning                                                                                
   * and thus should not perform expensive operations for each invocation.                                                                                              
   *                                                                                                                                                                    
   * Note that it is always better to overestimate size than underestimate, because underestimation                                                                     
   * could lead to execution plans that are suboptimal (i.e. broadcasting a very large table).                                                                          
   */                                                                                                                                                                   
  def sizeInBytes: Long = sqlContext.conf.defaultSizeInBytes                                                                                                            
}

对relation的抽象

还有一些XXXScan。

DataSourceStrategy

这里我理解就是一个翻译的过程,把逻辑计划和baseRelatiuon联系起来。

Filter

A filter predicate for data sources.
包含一些常见的操作比如 > = <等等。

ddl

这里也包含了一个parser和SQL Parser不同,这里主要是DDL语句,不过目前比较少了。
Parser[LogicalPlan] = createTable | describeTable | refreshTable

rules

主要有两个rule

/**                                                                                                                                                                     
 * A rule to do pre-insert data type casting and field renaming. Before we insert into                                                                                  
 * an [[InsertableRelation]], we will use this rule to make sure that                                                                                                   
 * the columns to be inserted have the correct data type and fields have the correct names.                                                                             
 */                                                                                                                                                                     
private[sql] object PreInsertCastAndRename extends Rule[LogicalPlan] {      

/**                                                                                                                                                                     
 * A rule to do various checks before inserting into or writing to a data source table.                                                                                 
 */                                                                                                                                                                     
private[sql] case class PreWriteCheck(catalog: Catalog) extends (LogicalPlan => Unit) {

JDBC

jdbc相关的东西,没什么太特殊的

JDBCRDD

真的去query数据

JDBCRelation

jdbc

* Saves a partition of a DataFrame to the JDBC database.  This is done in                                                                                          
     * a single database transaction in order to avoid repeatedly inserting                                                                                             
     * data as much as possible.                                                                                                                                        
     *                                                                                                                                                                  
     * It is still theoretically possible for rows in a DataFrame to be                                                                                                 
     * inserted into the database more than once if a stage somehow fails after                                                                                         
     * the commit occurs but before the stage can return successfully.                                                                                                  
     *                                                                                                                                                                  
     * This is not a closure inside saveTable() because apparently cosmetic                                                                                             
     * implementation changes elsewhere might easily render such a closure                                                                                              
     * non-Serializable.  Instead, we explicitly close over all variables that                                                                                          
     * are used.                                                                                                                                                        
     */                                                                                                                                                                 
    def savePartition(url: String, table: String, iterator: Iterator[Row],                                                                                              
        rddSchema: StructType, nullTypes: Array[Int]): Iterator[Byte] = {

DriverQuirks

quirk是怪癖的意思,实际上这个trait用来处理具体数据库和jdbc标准不一致的地方的。

/**                                                                                                                                                                     
 * Encapsulates workarounds for the extensions, quirks, and bugs in various                                                                                             
 * databases.  Lots of databases define types that aren't explicitly supported                                                                                          
 * by the JDBC spec.  Some JDBC drivers also report inaccurate                                                                                                          
 * information---for instance, BIT(n>1) being reported as a BIT type is quite                                                                                           
 * common, even though BIT in JDBC is meant for single-bit values.  Also, there                                                                                         
 * does not appear to be a standard name for an unbounded string or binary                                                                                              
 * type; we use BLOB and CLOB by default but override with database-specific                                                                                            
 * alternatives when these are absent or do not behave correctly.                                                                                                       
 *                                                                                                                                                                      
 * Currently, the only thing DriverQuirks does is handle type mapping.                                                                                                  
 * `getCatalystType` is used when reading from a JDBC table and `getJDBCType`                                                                                           
 * is used when writing to a JDBC table.  If `getCatalystType` returns `null`,                                                                                          
 * the default type handling is used for the given JDBC type.  Similarly,                                                                                               
 * if `getJDBCType` returns `(null, None)`, the default type handling is used                                                                                           
 * for the given Catalyst type.                                                                                                                                         
 */                                                                                                                                                                     
private[sql] abstract class DriverQuirks {                                                                                                                              
  def getCatalystType(sqlType: Int, typeName: String, size: Int, md: MetadataBuilder): DataType                                                                         
  def getJDBCType(dt: DataType): (String, Option[Int])                                                                                                                  
}

JSON

没太多好说的,处理Json,抢mongoDB他们的饭吃。ms还不支持partition。

SparkSQLParser

/**                                                                                                                                                                     
 * The top level Spark SQL parser. This parser recognizes syntaxes that are available for all SQL                                                                       
 * dialects supported by Spark SQL, and delegates all the other syntaxes to the `fallback` parser.                                                                      
 *                                                                                                                                                                      
 * @param fallback A function that parses an input string to a logical plan                                                                                             
 */                                                                                                                                                                     
private[sql] class SparkSQLParser(fallback: String => LogicalPlan) extends AbstractSparkSQLParser {

处理的语句包括:

override protected lazy val start: Parser[LogicalPlan] = cache | uncache | set | show | others

Column

A column in a [[DataFrame]]

DataFrame

A distributed collection of data organized into named columns.

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

A DataFrame is equivalent to a relational table in Spark SQL. There are multiple ways to create a DataFrame:

// Create a DataFrame from Parquet files
val people = sqlContext.parquetFile(“…”)

// Create a DataFrame from data sources
val df = sqlContext.load("...", "json")

Once created, it can be manipulated using the various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions.

To select a column from the data frame, use apply method in Scala and col in Java.

val ageCol = people("age")  // in Scala
Column ageCol = people.col("age")  // in Java

其函数基本分成两类:
1. 立即执行,进行真正的计算,比如override def take(n: Int): Array[Row] = head(n)
2. 不立即执行的,比如:

def join(right: DataFrame): DataFrame = {                                                                                                                             
    Join(logicalPlan, right.logicalPlan, joinType = Inner, None)                                                                                                        
}

就不立即执行了,生成logicPlan,需要1类的时候一起执行

DataFrameNaFunctions

/**                                                                                                                                                                     
 * :: Experimental ::                                                                                                                                                   
 * Functionality for working with missing data in [[DataFrame]]s.                                                                                                       
 */                                                                                                                                                                     
@Experimental                                                                                                                                                           
final class DataFrameNaFunctions private[sql](df: DataFrame) {

functions

真的就是一堆fuctions

parquet

todo
内容还不少,窃以为从原理上spark和parquet还算相配

CacheManager

没什么特殊的,就是cache的增删改查

/**                                                                                                                                                                     
 * Provides support in a SQLContext for caching query results and automatically using these cached                                                                      
 * results when subsequent queries are executed.  Data is cached using byte buffers stored in an                                                                        
 * InMemoryRelation.  This relation is automatically substituted query plans that return the                                                                            
 * `sameResult` as the originally cached query.                                                                                                                         
 *                                                                                                                                                                      
 * Internal to Spark SQL.                                                                                                                                               
 */                                                                                                                                                                     
private[sql] class CacheManager(sqlContext: SQLContext) extends Logging {

execution

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * An execution engine for relational query plans that runs on top Spark and returns RDDs.                                                                              
 *                                                                                                                                                                      
 * Note that the operators in this package are created automatically by a query planner using a                                                                         
 * [[SQLContext]] and are not intended to be used directly by end users of Spark SQL.  They are                                                                         
 * documented here in order to make it easier for others to understand the performance                                                                                  
 * characteristics of query plans that are generated by Spark SQL.                                                                                                      
 */                                                                                                                                                                     
package object execution

SparkPlan

@DeveloperApi                                                                                                                                                           
abstract class SparkPlan extends QueryPlan[SparkPlan] with Logging with Serializable {

Aggregate

处理和aggregate相关的,注意其execute函数,分为group by empty和group by not empty分别计算。
本质上就是一个hash based group 方法

GeneratedAggregate

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Alternate version of aggregation that leverages projection and thus code generation.                                                                                 
 * Aggregations are converted into a set of projections from a aggregation buffer tuple back onto                                                                       
 * itself. Currently only used for simple aggregations like SUM, COUNT, or AVERAGE are supported.                                                                       
 *                                                                                                                                                                      
 * @param partial if true then aggregation is done partially on local data without shuffling to                                                                         
 *                ensure all values where `groupingExpressions` are equal are present.                                                                                  
 * @param groupingExpressions expressions that are evaluated to determine grouping.                                                                                     
 * @param aggregateExpressions expressions that are computed for each group.                                                                                            
 * @param child the input data source.                                                                                                                                  
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class GeneratedAggregate(                                                                                                                                          
    partial: Boolean,                                                                                                                                                   
    groupingExpressions: Seq[Expression],                                                                                                                               
    aggregateExpressions: Seq[NamedExpression],                                                                                                                         
    child: SparkPlan)                                                                                                                                                   
  extends UnaryNode {

commands

各种各样的command,比如set , explain,desc啥的。

basicOperators

project, filter,sort等等
注意这里没有join

Exchange

这是一个非常重要的操作,其用途是切换数据的分布方式

ExistingRDD

从一个RDD里得到关系

Expand

/**                                                                                                                                                                     
 * Apply the all of the GroupExpressions to every input row, hence we will get                                                                                          
 * multiple output rows for a input row.                                                                                                                                
 * @param projections The group of expressions, all of the group expressions should                                                                                     
 *                    output the same schema specified bye the parameter `output`                                                                                       
 * @param output      The output Schema                                                                                                                                 
 * @param child       Child operator                                                                                                                                    
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class Expand(

Generate

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Applies a [[catalyst.expressions.Generator Generator]] to a stream of input rows, combining the                                                                      
 * output of each into a new stream of rows.  This operation is similar to a `flatMap` in functional                                                                    
 * programming with one important additional feature, which allows the input rows to be joined with                                                                     
 * their output.                                                                                                                                                        
 * @param join  when true, each output row is implicitly joined with the input tuple that produced                                                                      
 *              it.                                                                                                                                                     
 * @param outer when true, each input row will be output at least once, even if the output of the                                                                       
 *              given `generator` is empty. `outer` has no effect when `join` is false.                                                                                 
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class Generate(                                                                                                                                                    
    generator: Generator,                                                                                                                                               
    join: Boolean,                                                                                                                                                      
    outer: Boolean,                                                                                                                                                     
    child: SparkPlan)                                                                                                                                                   
  extends UnaryNode {

LocalTableScan

/**                                                                                                                                                                     
 * Physical plan node for scanning data from a local collection.                                                                                                        
 */                                                                                                                                                                     
case class LocalTableScan(output: Seq[Attribute], rows: Seq[Row]) extends LeafNode {

SparkSqlSerializer

private[sql] class SparkSqlSerializer(conf: SparkConf) extends KryoSerializer(conf) {

SparkStrategies

逻辑

joins

join在功能上还是做的比较完善的,比如left join,semi join,out join基本都有,各种组合也都有。
1. 如果基于同样的分布的话,那么就是具体的join
2. 如果不行的,就broadcast某个表
3. 再不行的话就只能做shuffle了。

/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Physical execution operators for join operations.                                                                                                                    
 */                                                                                                                                                                     
package object joins {
HashedRelation
/**                                                                                                                                                                     
 * Interface for a hashed relation by some key. Use [[HashedRelation.apply]] to create a concrete                                                                       
 * object.                                                                                                                                                              
 */                                                                                                                                                                     
private[joins] sealed trait HashedRelation {                                                                                                                            
  def get(key: Row): CompactBuffer[Row]                                                                                                                                 
}

主要包括GeneralHashedRelation和UniqueKeyHashedRelation

CartesianProduct
HashOuterJoin

叫这个名字,不过包含了LeftOutJoin,RightOutJoin,FullOutJoin

Broadcast*

BroadcastHashJoin
BroadcastLeftSemiJoinHash
BroadcastNestedLoopJoin

LeftSemiJoinBNL
/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Using BroadcastNestedLoopJoin to calculate left semi join result when there's no join keys                                                                           
 * for hash join.                                                                                                                                                       
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class LeftSemiJoinBNL(                                                                                                                                             
    streamed: SparkPlan, broadcast: SparkPlan, condition: Option[Expression])                                                                                           
  extends BinaryNode {
LeftSemiJoinHash
/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Build the right table's join keys into a HashSet, and iteratively go through the left                                                                                
 * table, to find the if join keys are in the Hash set.                                                                                                                 
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class LeftSemiJoinHash(                                                                                                                                            
    leftKeys: Seq[Expression],                                                                                                                                          
    rightKeys: Seq[Expression],                                                                                                                                         
    left: SparkPlan,                                                                                                                                                    
    right: SparkPlan) extends BinaryNode with HashJoin {
/**                                                                                                                                                                     
 * :: DeveloperApi ::                                                                                                                                                   
 * Performs an inner hash join of two child relations by first shuffling the data using the join                                                                        
 * keys.                                                                                                                                                                
 */                                                                                                                                                                     
@DeveloperApi                                                                                                                                                           
case class ShuffledHashJoin(                                                                                                                                            
    leftKeys: Seq[Expression],                                                                                                                                          
    rightKeys: Seq[Expression],                                                                                                                                         
    buildSide: BuildSide,                                                                                                                                               
    left: SparkPlan,                                                                                                                                                    
    right: SparkPlan)                                                                                                                                                   
  extends BinaryNode with HashJoin {