前言

本文以 Spark org.apache.spark.sql.Dataset 中的 map 方法为例, 讲解了Spark是如何使用Scala implicits的,让大家理解Spark程序中 import spark.implicits._的作用。

本文提供了相同功能的Java和Scala代码,希望可以通过对比来理解Scala implicit是如何帮助开发者简化Spark程序的。

Scala implicits 简介

implicit的中文解释是“含蓄的;不直接言明的;”。

我们先从现实生活中的一个例子来理解它的作用:

When we speak with each other, we do not explicitly mention everything we talk about, but there are many things that are understood by context.

If, for example, we are going to go out on a motorcycle and I ask you to give me the helmet, you will give me my helmet, however, I have not explicitly said that it is that helmet. You have understood it by context, that when I asked for the helmet I was referring to mine, it was implicit.

具体到代码上,

What would happen if we did not have to explicitly pass parameters to a function? If the function understood them by context? Or if we did not have to call a function explicitly and the compiler understood that by the context in which we are? Would we want to use it? That is the concept, everything is in the context, and there are different ways in which Scala has implemented the concept of implicits.

Scala implicits 类型

关于各个类型的用法,请参考下面代码。

implicit parameters

def sendText(body: String)(implicit from: String): String = s"$body, from: $from"

// 没有implicit 定以
sendText("Hello")("Robby") // Hello, from: Robby

// 自动根据implicit 定义的情况
implicit val sender: String = "Tina"
sendText("Hi") // Hi, from : Tina

Implicit conversions (implicit functions)

/**
* This is one of those cases, the compiler complains about createNumber
* because it returns Int and not String. We are going to create an implicit
* conversion to do the transformation automatically:
*/
implicit def Int2String(number: Int): String = number.toString()
def createNumber: Int = scala.util.Random.nextInt
val myNumberInString: String = createNumber
val myText: String = 123

Implicit classes

Implicit objects

Spark Scala SDK中的 implicits

我们以 org.apache.spark.sql.Dataset 中的map方法为例,Dataset中定义了两个map方法, 如下所示:

理解 spark.implicits.__Scala

下面我们分别演示用Java API 和 Scala API来调用map方法, 来看看Scala implicits的便捷性以及Spark已经为我们提供好的通用implicits(定义在spark.implicits._)

Spark Java API 示例

ddu.spark.blog.encoders.EncoderJavaSample

package ddu.spark.blog.encoders;

import org.apache.spark.api.java.function.ForeachFunction;
import org.apache.spark.api.java.function.MapFunction;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Encoder;
import org.apache.spark.sql.Encoders;
import org.apache.spark.sql.SparkSession;

public class EncoderJavaSample {

public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("EncoderJavaSample")
.master("local[1]")
.getOrCreate();

Dataset<Long> longDataset = spark.range(1, 5);

/*
Java API 调用 Dataset.map[U](func: MapFunction[T, U], encoder: Encoder[U])
*/
Dataset<String> stringDataset = longDataset
.map((MapFunction<Long, String>) num -> String.format("No: %s", num), Encoders.STRING());
stringDataset.printSchema();
stringDataset.show();

Encoder<NumberObject> numberObjectEncoder = Encoders.bean(NumberObject.class);
Dataset<NumberObject> numberObjectDataset = longDataset
.map((MapFunction<Long, NumberObject>) NumberObject::new, numberObjectEncoder);
numberObjectDataset.printSchema();
numberObjectDataset.show();

numberObjectDataset.foreach((ForeachFunction<NumberObject>) obj -> System.out.println(obj));

// lambda的等价
/*
numberObjectDataset.foreach(new ForeachFunction<NumberObject>() {
@Override
public void call(NumberObject numberObject) throws Exception {
System.out.println(numberObject);
}
});
*/

spark.close();

}
}

Spark Scala API 示例

ddu.spark.blog.encoders.EncoderJavaSample

package ddu.spark.blog.encoders

import org.apache.spark.api.java.function.MapFunction
import org.apache.spark.sql.{Dataset, Encoder, Encoders, SparkSession}

object EncoderScalaSample {
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.master("local[1]")
.appName("EncoderScalaSample")
.getOrCreate()

notUseImplicits(spark)
useSelfDefinedImplicits(spark)
useSparkImplicits(spark)

spark.close()
}

/**
* 使用Spark提供的隐式转换(implicits conversion)
* 调用 Dataset.map[U : Encoder](func: T => U): Dataset[U]
*/
def useSparkImplicits(spark: SparkSession): Unit = {

println("=== useSparkImplicits ===")

val longDataset = spark.range(1, 5)

// 引用 SQLImplicits中的定义
// implicit def newStringEncoder: Encoder[String] = Encoders.STRING
import spark.implicits._
val stringDataset = longDataset.map(num => s"No: $num")
stringDataset.printSchema()
stringDataset.show()
}

/**
* 自己定义 隐式转换(implicits conversion), 这种方式和Java API 相同,
* 调用 Dataset.map[U : Encoder](func: T => U): Dataset[U]
*/
def useSelfDefinedImplicits(spark: SparkSession): Unit = {

println("=== useSelfDefinedImplicits ===")

val longDataset = spark.range(1, 5)

// 定义 implicit parameter: newStringEncoder
implicit def newStringEncoder: Encoder[String] = Encoders.STRING
val stringDataset = longDataset.map(num => s"No: $num")
//val stringDataset = longDataset.map(num => s"No: $num")(Encoders.STRING)
stringDataset.printSchema()
stringDataset.show()

// 定义 implicit parameter: numberObjectEncoder
implicit def numberObjectEncoder: Encoder[NumberObject] = Encoders.bean(classOf[NumberObject])
val numberObjectDataset = longDataset.map(num => new NumberObject(num))
numberObjectDataset.printSchema()
numberObjectDataset.show()
}

/**
* 不用 scala 的 隐式转换(implicits conversion), 这种方式和Java API 相同,
* 调用 Dataset.map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]
*/
def notUseImplicits(spark: SparkSession): Unit = {

println("=== notUseImplicits ===")

val longDataset = spark.range(1, 5)

val stringDataset: Dataset[String] = longDataset.map(
new MapFunction[java.lang.Long, String] {
override def call(t: java.lang.Long): String = {
s"No: $t.toString"
}
}.asInstanceOf[MapFunction[java.lang.Long, String]],
Encoders.STRING
)
stringDataset.printSchema()
stringDataset.show()

val numberObjectDataset: Dataset[NumberObject] = longDataset.map(
new MapFunction[java.lang.Long, NumberObject]() {
override def call(t: java.lang.Long): NumberObject = {
new NumberObject(t)
}
}.asInstanceOf[MapFunction[java.lang.Long, NumberObject]], Encoders.bean(classOf[NumberObject]))
numberObjectDataset.printSchema()
numberObjectDataset.show()
}

}