Spark DataFrame sql函数总结

Spark DataFrame内置了200+个函数供使用,包括聚合、集合、时间、字符串、数学、排序、窗口、UDF等多类函数,是个十分齐全的百宝箱,灵活运用可以事半功倍。
用之前需要导入sql函数

import org.apache.spark.sql.functions._
自定义UDF函数

如果觉得百宝箱不够用,需要自己造个轮子,可以用udf来实现

// 自定义udf的函数
  val ageFiled = (age: String) => {
    val ageInt = age.toInt
    ageInt match {
      case age if age <= 12 => "1"
      case age if age >= 13 && age <= 17 => "2"
      case age if age >= 18 && age <= 24 => "3"
      case age if age >= 25 && age <= 30 => "4"
      case age if age >= 31 && age <= 35 => "5"
      case age if age >= 36 && age <= 40 => "6"
      case age if age >= 41 && age <= 50 => "7"
      case age if age >= 51 && age <= 60 => "8"
      case age if age >= 61 => "9"
    }
  }

在具体使用的时候,对这个自定义的udf函数进行声明后就可以和内置函数一样的方式使用啦,例如新增一列年龄段示例如下

val getAgeField = udf(ageFiled)
val result = df.withColumn("agefield", getAgeField($"age"))
窗口函数

窗口函数和Hive里面的一样,在Spark DataFrame中使用,主要有两个步骤。
第一,定义窗口的特征。首先定义分组(partitionBy),然后定义排序(orderBy),最后定义窗口大小(rowsBetween、rangeBetween)
第二,使用窗口函数。现阶段有2类函数可以作为窗口函数,聚集函数(max、min、avg…),排序函数(rank、dense_rank、percent_rank、ntile、row_number)

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

object Test {

  def main(args: Array[String]): Unit = {
        val spark = SparkSession.builder()
          .appName("Test")
          .master("local")
          .getOrCreate()

        import spark.implicits._
        val df = spark.createDataset(Seq(
          ("2010-07-22", "猪聪明", 1, "男", 51, 84, 73, "四川"),
          ("2010-07-23", "猪坚强", 2, "男", 89, 85, 88, "广东"),
          ("2010-07-24", "猪勇敢", 1, "男", 40, 86, 78, "广东"),
          ("2010-07-22", "猪能干", 2, "男", 81, 85, 56, "湖北"),
          ("2010-07-23", "猪豪杰", 3, "男", 77, 82, 93, "四川"),
          ("2010-07-21", "猪可爱", 3, "女", 11, 82, 70, "湖北"),
          ("2010-07-22", "猪温柔", 4, "女", 42, 86, 34, "湖北"),
          ("2010-07-23", "猪美丽", 4, "女", 54, 84, 92, "湖北"),
          ("2010-07-25", "猪优雅", 4, "女", 40, 91, 68, "四川"),
          ("2010-07-26", "猪大方", 4, "女", 23, 97, 68, "广东"))
        ).toDF("birthday", "name", "class", "sex", "math", "eng", "tech", "address")

    // 按男女分组的数学平均值
    val windowBySex = Window.partitionBy("sex")
    df.withColumn("sex_avg", avg("math").over(windowBySex)).show(false)

  }

}

输出结果

+----------+----+-----+---+----+---+----+-------+-------+
|birthday  |name|class|sex|math|eng|tech|address|sex_avg|
+----------+----+-----+---+----+---+----+-------+-------+
|2010-07-22|猪聪明 |1    |男  |51  |84 |73  |四川     |67.6   |
|2010-07-23|猪坚强 |2    |男  |89  |85 |88  |广东     |67.6   |
|2010-07-24|猪勇敢 |1    |男  |40  |86 |78  |广东     |67.6   |
|2010-07-22|猪能干 |2    |男  |81  |85 |56  |湖北     |67.6   |
|2010-07-23|猪豪杰 |3    |男  |77  |82 |93  |四川     |67.6   |
|2010-07-21|猪可爱 |3    |女  |11  |82 |70  |湖北     |34.0   |
|2010-07-22|猪温柔 |4    |女  |42  |86 |34  |湖北     |34.0   |
|2010-07-23|猪美丽 |4    |女  |54  |84 |92  |湖北     |34.0   |
|2010-07-25|猪优雅 |4    |女  |40  |91 |68  |四川     |34.0   |
|2010-07-26|猪大方 |4    |女  |23  |97 |68  |广东     |34.0   |
+----------+----+-----+---+----+---+----+-------+-------+

按地域分组求每组数据成绩最高值

// 按地域分组看每组英语最高值
    val windowByAddress = Window.partitionBy("address")
    df.withColumn("max_eng_addr", max("eng").over(windowByAddress)).show(false)

结果输出

+----------+----+-----+---+----+---+----+-------+------------+
|birthday  |name|class|sex|math|eng|tech|address|max_eng_addr|
+----------+----+-----+---+----+---+----+-------+------------+
|2010-07-23|猪坚强 |2    |男  |89  |85 |88  |广东     |97          |
|2010-07-24|猪勇敢 |1    |男  |40  |86 |78  |广东     |97          |
|2010-07-26|猪大方 |4    |女  |23  |97 |68  |广东     |97          |
|2010-07-22|猪能干 |2    |男  |81  |85 |56  |湖北     |86          |
|2010-07-21|猪可爱 |3    |女  |11  |82 |70  |湖北     |86          |
|2010-07-22|猪温柔 |4    |女  |42  |86 |34  |湖北     |86          |
|2010-07-23|猪美丽 |4    |女  |54  |84 |92  |湖北     |86          |
|2010-07-22|猪聪明 |1    |男  |51  |84 |73  |四川     |91          |
|2010-07-23|猪豪杰 |3    |男  |77  |82 |93  |四川     |91          |
|2010-07-25|猪优雅 |4    |女  |40  |91 |68  |四川     |91          |
+----------+----+-----+---+----+---+----+-------+------------+

按班级分组看每组数学成绩排序

// 按班级分组看每组数学成绩排序(需要先对window分组进行orderBy排序)
    val windowByClass = Window.partitionBy("class").orderBy($"math".desc)
    df.withColumn("rank_math_class", rank().over(windowByClass)).show(false)

结果输出

+----------+----+-----+---+----+---+----+-------+---------------+
|birthday  |name|class|sex|math|eng|tech|address|rank_math_class|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-22|猪聪明 |1    |男  |51  |84 |73  |四川     |1              |
|2010-07-24|猪勇敢 |1    |男  |40  |86 |78  |广东     |2              |
|2010-07-23|猪豪杰 |3    |男  |77  |82 |93  |四川     |1              |
|2010-07-21|猪可爱 |3    |女  |11  |82 |70  |湖北     |2              |
|2010-07-23|猪美丽 |4    |女  |54  |84 |92  |湖北     |1              |
|2010-07-22|猪温柔 |4    |女  |42  |86 |34  |湖北     |2              |
|2010-07-25|猪优雅 |4    |女  |40  |91 |68  |四川     |3              |
|2010-07-26|猪大方 |4    |女  |23  |97 |68  |广东     |4              |
|2010-07-23|猪坚强 |2    |男  |89  |85 |88  |广东     |1              |
|2010-07-22|猪能干 |2    |男  |81  |85 |56  |湖北     |2              |
+----------+----+-----+---+----+---+----+-------+---------------+

按照性别分组看每组数学成绩排序,新增一列按性别分组对每组进行分桶并编号

// 按照性别分组看每组数学成绩排序
    // 按性别分组对每组进行分桶并编号
    val windowBySex1 = Window.partitionBy("sex").orderBy($"math".desc)
    df.withColumn("dense_rank_math_class", dense_rank().over(windowBySex1))
      .withColumn("ntile_3", ntile(3).over(windowBySex1))
      .withColumn("math_no", row_number().over(windowBySex1))
      .withColumn("math_no_persent", percent_rank().over(windowBySex1))
      .show(false)

输出结果

+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|birthday  |name|class|sex|math|eng|tech|address|dense_rank_math_class|ntile_3|math_no|math_no_persent|
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|2010-07-23|猪坚强 |2    |男  |89  |85 |88  |广东     |1                    |1      |1      |0.0            |
|2010-07-22|猪能干 |2    |男  |81  |85 |56  |湖北     |2                    |1      |2      |0.25           |
|2010-07-23|猪豪杰 |3    |男  |77  |82 |93  |四川     |3                    |2      |3      |0.5            |
|2010-07-22|猪聪明 |1    |男  |51  |84 |73  |四川     |4                    |2      |4      |0.75           |
|2010-07-24|猪勇敢 |1    |男  |40  |86 |78  |广东     |5                    |3      |5      |1.0            |
|2010-07-23|猪美丽 |4    |女  |54  |84 |92  |湖北     |1                    |1      |1      |0.0            |
|2010-07-22|猪温柔 |4    |女  |42  |86 |34  |湖北     |2                    |1      |2      |0.25           |
|2010-07-25|猪优雅 |4    |女  |40  |91 |68  |四川     |3                    |2      |3      |0.5            |
|2010-07-26|猪大方 |4    |女  |23  |97 |68  |广东     |4                    |2      |4      |0.75           |
|2010-07-21|猪可爱 |3    |女  |11  |82 |70  |湖北     |5                    |3      |5      |1.0            |
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+

按照性别分组取每组数学成绩第一名和最后一名

val windowBySex2 = Window.partitionBy("sex").orderBy($"math".desc)
    df.withColumn("number_math_sex", row_number().over(windowBySex2))
      .filter($"number_math_sex" <= 1)
      .show(false)

    // 这里使用selectExpr完成同样的功能,按照性别分组求每组数学成绩最高值
    df.selectExpr("sex","max(math) over (partition by sex order by math desc )  as max_math")
      .distinct()
      .show(false)

输出结果

+----------+----+-----+---+----+---+----+-------+---------------+
|birthday  |name|class|sex|math|eng|tech|address|number_math_sex|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-23|猪坚强 |2    |男  |89  |85 |88  |广东     |1              |
|2010-07-23|猪美丽 |4    |女  |54  |84 |92  |湖北     |1              |
+----------+----+-----+---+----+---+----+-------+---------------+

+---+--------+
|sex|max_math|
+---+--------+
|男  |89      |
|女  |54      |
+---+--------+
结构化数据类型

对于结构化数据类型的访问示例,可以使用DataSet方式来处理

import org.apache.spark.sql.SparkSession

object Test {

  case class Score(math: Int, eng: Int, tech: Int)

  case class Info(name: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)

  case class Interest(list: Seq[String])

  case class AddressInfo(name: String, address: String)

  case class Student(name: String, address: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("Test")
      .master("local")
      .getOrCreate()

    import spark.implicits._
    val df = spark.createDataFrame(Seq(
      Info("猪聪明", "2010-07-22", 1, "男", Score(51, 84, 73), Interest(Seq("弹琴", "跳舞", "画画"))),
      Info("猪坚强", "2010-07-23", 2, "男", Score(89, 85, 88), Interest(Seq("跳舞", "画画"))),
      Info("猪勇敢", "2010-07-24", 1, "男", Score(40, 86, 78), Interest(Seq("弹琴", "跳舞"))),
      Info("猪能干", "2010-07-22", 2, "男", Score(81, 85, 56), Interest(Seq("弹琴"))),
      Info("猪豪杰", "2010-07-23", 3, "男", Score(77, 82, 93), Interest(Seq("画画"))),
      Info("猪可爱", "2010-07-21", 3, "女", Score(11, 82, 70), Interest(Seq("画画"))),
      Info("猪温柔", "2010-07-22", 4, "女", Score(42, 86, 34), Interest(Seq("弹琴"))),
      Info("猪美丽", "2010-07-23", 4, "女", Score(54, 84, 92), Interest(Seq("跳舞"))),
      Info("猪优雅", "2010-07-25", 4, "女", Score(40, 91, 68), Interest(Seq("弹琴", "跳舞", "画画"))),
      Info("猪大方", "2010-07-26", 4, "女", Score(23, 97, 68), Interest(Seq("跳舞", "画画")))
    ))

    val address = spark.createDataFrame(
      Seq(
        AddressInfo("猪聪明", "四川"),
        AddressInfo("猪坚强", "广东"),
        AddressInfo("猪勇敢", "广东"),
        AddressInfo("猪能干", "湖北"),
        AddressInfo("猪豪杰", "四川"),
        AddressInfo("猪可爱", "湖北"),
        AddressInfo("猪温柔", "湖北"),
        AddressInfo("猪美丽", "湖北"),
        AddressInfo("猪优雅", "四川"),
        AddressInfo("猪大方", "广东")
      )
    )

    // 对于结构化数据类型的访问
    df.join(address, Seq("name"), "inner")
      .select("name", "score.math", "address")
      .show(false)

    // 可以直接转为DataSet,这样可以使用其属性
    val addressDS = address.as[AddressInfo]
    val ds = df.as[Info]
    ds.join(addressDS, Seq("name"), "inner")
      .as[Student] // 这样直接转为DataSet对象
      .map(x => (x.name, x.address))
      .show(false)
  }

}
参考
  1. org.apache.spark.sql.functions官网
  2. Spark API 全集(2):Spark SQL 函数全集
  3. Spark 实现简单移动平均值(SMA) - 窗口函数(Window Function)