Spark DataFrame sql函数总结
Spark DataFrame内置了200+个函数供使用,包括聚合、集合、时间、字符串、数学、排序、窗口、UDF等多类函数,是个十分齐全的百宝箱,灵活运用可以事半功倍。
用之前需要导入sql函数
import org.apache.spark.sql.functions._
自定义UDF函数
如果觉得百宝箱不够用,需要自己造个轮子,可以用udf来实现
// 自定义udf的函数
val ageFiled = (age: String) => {
val ageInt = age.toInt
ageInt match {
case age if age <= 12 => "1"
case age if age >= 13 && age <= 17 => "2"
case age if age >= 18 && age <= 24 => "3"
case age if age >= 25 && age <= 30 => "4"
case age if age >= 31 && age <= 35 => "5"
case age if age >= 36 && age <= 40 => "6"
case age if age >= 41 && age <= 50 => "7"
case age if age >= 51 && age <= 60 => "8"
case age if age >= 61 => "9"
}
}
在具体使用的时候,对这个自定义的udf函数进行声明后就可以和内置函数一样的方式使用啦,例如新增一列年龄段示例如下
val getAgeField = udf(ageFiled)
val result = df.withColumn("agefield", getAgeField($"age"))
窗口函数
窗口函数和Hive里面的一样,在Spark DataFrame中使用,主要有两个步骤。
第一,定义窗口的特征。首先定义分组(partitionBy),然后定义排序(orderBy),最后定义窗口大小(rowsBetween、rangeBetween)
第二,使用窗口函数。现阶段有2类函数可以作为窗口函数,聚集函数(max、min、avg…),排序函数(rank、dense_rank、percent_rank、ntile、row_number)
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Test {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Test")
.master("local")
.getOrCreate()
import spark.implicits._
val df = spark.createDataset(Seq(
("2010-07-22", "猪聪明", 1, "男", 51, 84, 73, "四川"),
("2010-07-23", "猪坚强", 2, "男", 89, 85, 88, "广东"),
("2010-07-24", "猪勇敢", 1, "男", 40, 86, 78, "广东"),
("2010-07-22", "猪能干", 2, "男", 81, 85, 56, "湖北"),
("2010-07-23", "猪豪杰", 3, "男", 77, 82, 93, "四川"),
("2010-07-21", "猪可爱", 3, "女", 11, 82, 70, "湖北"),
("2010-07-22", "猪温柔", 4, "女", 42, 86, 34, "湖北"),
("2010-07-23", "猪美丽", 4, "女", 54, 84, 92, "湖北"),
("2010-07-25", "猪优雅", 4, "女", 40, 91, 68, "四川"),
("2010-07-26", "猪大方", 4, "女", 23, 97, 68, "广东"))
).toDF("birthday", "name", "class", "sex", "math", "eng", "tech", "address")
// 按男女分组的数学平均值
val windowBySex = Window.partitionBy("sex")
df.withColumn("sex_avg", avg("math").over(windowBySex)).show(false)
}
}
输出结果
+----------+----+-----+---+----+---+----+-------+-------+
|birthday |name|class|sex|math|eng|tech|address|sex_avg|
+----------+----+-----+---+----+---+----+-------+-------+
|2010-07-22|猪聪明 |1 |男 |51 |84 |73 |四川 |67.6 |
|2010-07-23|猪坚强 |2 |男 |89 |85 |88 |广东 |67.6 |
|2010-07-24|猪勇敢 |1 |男 |40 |86 |78 |广东 |67.6 |
|2010-07-22|猪能干 |2 |男 |81 |85 |56 |湖北 |67.6 |
|2010-07-23|猪豪杰 |3 |男 |77 |82 |93 |四川 |67.6 |
|2010-07-21|猪可爱 |3 |女 |11 |82 |70 |湖北 |34.0 |
|2010-07-22|猪温柔 |4 |女 |42 |86 |34 |湖北 |34.0 |
|2010-07-23|猪美丽 |4 |女 |54 |84 |92 |湖北 |34.0 |
|2010-07-25|猪优雅 |4 |女 |40 |91 |68 |四川 |34.0 |
|2010-07-26|猪大方 |4 |女 |23 |97 |68 |广东 |34.0 |
+----------+----+-----+---+----+---+----+-------+-------+
按地域分组求每组数据成绩最高值
// 按地域分组看每组英语最高值
val windowByAddress = Window.partitionBy("address")
df.withColumn("max_eng_addr", max("eng").over(windowByAddress)).show(false)
结果输出
+----------+----+-----+---+----+---+----+-------+------------+
|birthday |name|class|sex|math|eng|tech|address|max_eng_addr|
+----------+----+-----+---+----+---+----+-------+------------+
|2010-07-23|猪坚强 |2 |男 |89 |85 |88 |广东 |97 |
|2010-07-24|猪勇敢 |1 |男 |40 |86 |78 |广东 |97 |
|2010-07-26|猪大方 |4 |女 |23 |97 |68 |广东 |97 |
|2010-07-22|猪能干 |2 |男 |81 |85 |56 |湖北 |86 |
|2010-07-21|猪可爱 |3 |女 |11 |82 |70 |湖北 |86 |
|2010-07-22|猪温柔 |4 |女 |42 |86 |34 |湖北 |86 |
|2010-07-23|猪美丽 |4 |女 |54 |84 |92 |湖北 |86 |
|2010-07-22|猪聪明 |1 |男 |51 |84 |73 |四川 |91 |
|2010-07-23|猪豪杰 |3 |男 |77 |82 |93 |四川 |91 |
|2010-07-25|猪优雅 |4 |女 |40 |91 |68 |四川 |91 |
+----------+----+-----+---+----+---+----+-------+------------+
按班级分组看每组数学成绩排序
// 按班级分组看每组数学成绩排序(需要先对window分组进行orderBy排序)
val windowByClass = Window.partitionBy("class").orderBy($"math".desc)
df.withColumn("rank_math_class", rank().over(windowByClass)).show(false)
结果输出
+----------+----+-----+---+----+---+----+-------+---------------+
|birthday |name|class|sex|math|eng|tech|address|rank_math_class|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-22|猪聪明 |1 |男 |51 |84 |73 |四川 |1 |
|2010-07-24|猪勇敢 |1 |男 |40 |86 |78 |广东 |2 |
|2010-07-23|猪豪杰 |3 |男 |77 |82 |93 |四川 |1 |
|2010-07-21|猪可爱 |3 |女 |11 |82 |70 |湖北 |2 |
|2010-07-23|猪美丽 |4 |女 |54 |84 |92 |湖北 |1 |
|2010-07-22|猪温柔 |4 |女 |42 |86 |34 |湖北 |2 |
|2010-07-25|猪优雅 |4 |女 |40 |91 |68 |四川 |3 |
|2010-07-26|猪大方 |4 |女 |23 |97 |68 |广东 |4 |
|2010-07-23|猪坚强 |2 |男 |89 |85 |88 |广东 |1 |
|2010-07-22|猪能干 |2 |男 |81 |85 |56 |湖北 |2 |
+----------+----+-----+---+----+---+----+-------+---------------+
按照性别分组看每组数学成绩排序,新增一列按性别分组对每组进行分桶并编号
// 按照性别分组看每组数学成绩排序
// 按性别分组对每组进行分桶并编号
val windowBySex1 = Window.partitionBy("sex").orderBy($"math".desc)
df.withColumn("dense_rank_math_class", dense_rank().over(windowBySex1))
.withColumn("ntile_3", ntile(3).over(windowBySex1))
.withColumn("math_no", row_number().over(windowBySex1))
.withColumn("math_no_persent", percent_rank().over(windowBySex1))
.show(false)
输出结果
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|birthday |name|class|sex|math|eng|tech|address|dense_rank_math_class|ntile_3|math_no|math_no_persent|
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
|2010-07-23|猪坚强 |2 |男 |89 |85 |88 |广东 |1 |1 |1 |0.0 |
|2010-07-22|猪能干 |2 |男 |81 |85 |56 |湖北 |2 |1 |2 |0.25 |
|2010-07-23|猪豪杰 |3 |男 |77 |82 |93 |四川 |3 |2 |3 |0.5 |
|2010-07-22|猪聪明 |1 |男 |51 |84 |73 |四川 |4 |2 |4 |0.75 |
|2010-07-24|猪勇敢 |1 |男 |40 |86 |78 |广东 |5 |3 |5 |1.0 |
|2010-07-23|猪美丽 |4 |女 |54 |84 |92 |湖北 |1 |1 |1 |0.0 |
|2010-07-22|猪温柔 |4 |女 |42 |86 |34 |湖北 |2 |1 |2 |0.25 |
|2010-07-25|猪优雅 |4 |女 |40 |91 |68 |四川 |3 |2 |3 |0.5 |
|2010-07-26|猪大方 |4 |女 |23 |97 |68 |广东 |4 |2 |4 |0.75 |
|2010-07-21|猪可爱 |3 |女 |11 |82 |70 |湖北 |5 |3 |5 |1.0 |
+----------+----+-----+---+----+---+----+-------+---------------------+-------+-------+---------------+
按照性别分组取每组数学成绩第一名和最后一名
val windowBySex2 = Window.partitionBy("sex").orderBy($"math".desc)
df.withColumn("number_math_sex", row_number().over(windowBySex2))
.filter($"number_math_sex" <= 1)
.show(false)
// 这里使用selectExpr完成同样的功能,按照性别分组求每组数学成绩最高值
df.selectExpr("sex","max(math) over (partition by sex order by math desc ) as max_math")
.distinct()
.show(false)
输出结果
+----------+----+-----+---+----+---+----+-------+---------------+
|birthday |name|class|sex|math|eng|tech|address|number_math_sex|
+----------+----+-----+---+----+---+----+-------+---------------+
|2010-07-23|猪坚强 |2 |男 |89 |85 |88 |广东 |1 |
|2010-07-23|猪美丽 |4 |女 |54 |84 |92 |湖北 |1 |
+----------+----+-----+---+----+---+----+-------+---------------+
+---+--------+
|sex|max_math|
+---+--------+
|男 |89 |
|女 |54 |
+---+--------+
结构化数据类型
对于结构化数据类型的访问示例,可以使用DataSet方式来处理
import org.apache.spark.sql.SparkSession
object Test {
case class Score(math: Int, eng: Int, tech: Int)
case class Info(name: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)
case class Interest(list: Seq[String])
case class AddressInfo(name: String, address: String)
case class Student(name: String, address: String, birthday: String, classNo: Int, sec: String, score: Score, interest: Interest)
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Test")
.master("local")
.getOrCreate()
import spark.implicits._
val df = spark.createDataFrame(Seq(
Info("猪聪明", "2010-07-22", 1, "男", Score(51, 84, 73), Interest(Seq("弹琴", "跳舞", "画画"))),
Info("猪坚强", "2010-07-23", 2, "男", Score(89, 85, 88), Interest(Seq("跳舞", "画画"))),
Info("猪勇敢", "2010-07-24", 1, "男", Score(40, 86, 78), Interest(Seq("弹琴", "跳舞"))),
Info("猪能干", "2010-07-22", 2, "男", Score(81, 85, 56), Interest(Seq("弹琴"))),
Info("猪豪杰", "2010-07-23", 3, "男", Score(77, 82, 93), Interest(Seq("画画"))),
Info("猪可爱", "2010-07-21", 3, "女", Score(11, 82, 70), Interest(Seq("画画"))),
Info("猪温柔", "2010-07-22", 4, "女", Score(42, 86, 34), Interest(Seq("弹琴"))),
Info("猪美丽", "2010-07-23", 4, "女", Score(54, 84, 92), Interest(Seq("跳舞"))),
Info("猪优雅", "2010-07-25", 4, "女", Score(40, 91, 68), Interest(Seq("弹琴", "跳舞", "画画"))),
Info("猪大方", "2010-07-26", 4, "女", Score(23, 97, 68), Interest(Seq("跳舞", "画画")))
))
val address = spark.createDataFrame(
Seq(
AddressInfo("猪聪明", "四川"),
AddressInfo("猪坚强", "广东"),
AddressInfo("猪勇敢", "广东"),
AddressInfo("猪能干", "湖北"),
AddressInfo("猪豪杰", "四川"),
AddressInfo("猪可爱", "湖北"),
AddressInfo("猪温柔", "湖北"),
AddressInfo("猪美丽", "湖北"),
AddressInfo("猪优雅", "四川"),
AddressInfo("猪大方", "广东")
)
)
// 对于结构化数据类型的访问
df.join(address, Seq("name"), "inner")
.select("name", "score.math", "address")
.show(false)
// 可以直接转为DataSet,这样可以使用其属性
val addressDS = address.as[AddressInfo]
val ds = df.as[Info]
ds.join(addressDS, Seq("name"), "inner")
.as[Student] // 这样直接转为DataSet对象
.map(x => (x.name, x.address))
.show(false)
}
}
参考
- org.apache.spark.sql.functions官网
- Spark API 全集(2):Spark SQL 函数全集
- Spark 实现简单移动平均值(SMA) - 窗口函数(Window Function)