目录

  • 1. Spark SQL
  • 1.1 DataFrame的创建
  • 1.1.1 通过样例类
  • 1.1.2 json数据文件
  • 1.1.3 RDD[Row]转换为DF(重点)
  • 1.2 DataSet的创建
  • 2. Spark SQL语法
  • 2.1 DSL风格编程
  • 2.2 sql操作
  • 3. 加载和保存数据
  • 3.1 SparkSql加载文件
  • 3.2 SparkSql保存文件
  • 4. UDF


1. Spark SQL

Spark SQL在RDD之上抽象出来Dataset(数据集)和DataFrame(数据帧)提供类似RDD的处理功能,对数据进行处理时有两种方式一种强类型(对象)和无类型操作(SQL字段)

import org.apache.spark.sql.SparkSession
val spark = SparkSession
      .builder()
      .appName("Spark SQL data sources example")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

Dataset和DataFrame的区别

DataFrame是Dataset的一种特殊形式,如:在Dataset中可以存储任意对象类型的数据作为Dataset的元素,但是DataFrame中元素的类型只能是Row类型,Row类似于数据库行,基于Row查询和传统数据库的ResultSet操作极其相似。因为Row类型的数据时DataaFrame中的一个元素,类似于数据库的一行记录,这些行中的元素都可以通过下标或者列名方式访问获取。

RDD —> DataSet(加强RDD)—> DataFrame(特殊Dataset[Row])

1.1 DataFrame的创建

1.1.1 通过样例类

case class Person(name:String,age:Long)   // 定义样例类,相当于表的schema
 val peopleDF = spark.sparkContext
      .textFile("src/main/resources/people.txt")
      .map(_.split(","))
      //.map(x => (x(0), x(1).trim.toInt))
      //.toDF("name", "age")
      .map(attributes => Person(attributes(0), attributes(1).trim.toInt))
      .toDF()
      
// 或者
var rdd1 = sc.makeRDD(List(("1001","zhangsan","f"),("1002","lisi","f")))
var df = rdd1.toDF("id","name","gender")

1.1.2 json数据文件

val df = spark.read.json("src/main/resources/people.json")

1.1.3 RDD[Row]转换为DF(重点)

val rdd = spark.sparkContext.makeRDD(List(Person(1, "zs"), Person(2, "ls")))
val rdd2: RDD[Row] = rdd
.map(person => Row(person.id, person.name))

// structType 主要定义的是结果表的schema
val structType = new StructType()
// 列名 + 数据类型 + 是否允许为空
.add("id", IntegerType, true, "注释1")
.add(name = "name", StringType, true, "注释2")

// RDD[Row]  ---> DF
val df = spark.createDataFrame(rdd2, structType)
df.show()
df.printSchema()

1.2 DataSet的创建

// 1. 通过样例类
val caseClassDS = Seq(Person("Andy", 32)).toDS()
// 2. 通过数组
val primitiveDS = Seq(1, 2, 3).toDS()
// 3. json数据
val path = "src/main/resources/people.json"
val peopleDS = spark.read.json(path).as[Person]

2. Spark SQL语法

2.1 DSL风格编程

  1. DataFrame的getAs和getString函数解析
{"name":"Michael","age":10, "adress": "beijin"}
{"name":"Andy", "age":30, "adress": "beijin"}
{"name":"Justin", "age":19, "adress": "beijin"}
val peopleDF = spark.read.json(./people.json)
# 取出dataframe中address对应的数据
peopleDF.map(x => x.getAs[String]("adress")).show()
peopleDF.map(x => x.getString(0)).show()

从上面两个函数得出的运行结果相同,可以得出结论,JSON中的字段转化为DataFrame后的fieldIndex,是从右向左的,"adress"的fieldIndex为0,"age"的fieldIndex为1,"name"的fieldIndex为2。

  1. 条件过滤 filter|where用法相同| limit
df.where("name='zs' or salary > 18000")
df.orderBy($"salary" desc).limit(3).show()
  1. select |col| selectExpr:支持表达式(基本运算或者别名)的投影查询
df.select("id","name").show()
df.selectExpr("id+10","name as username")
df.col('id')
  1. printSchema() 打印输出表结构信息
  2. withColumn 添加列|withColumnRenamed 对列重命名|drop 删除特定列
df.withColumn("year_salary",$"salary"*12).show() 
df.withColumnRenamed("name","user_name")
df.drop("salary")
  1. 去重dropDuplicates|distinct
df.dropDuplicates("sex")
  1. orderBy| sort 结果排序
df.orderBy($"salary" desc ,$"id").show()
df.sort($"salary" desc ,$"id")
  1. groupBy 分组方法 | agg 聚合方法
df.groupBy($"dept").max("salary").where("dept = 2") // 等价于having 
  .show()

val df = List(
	(1, "zs", false, 1, 15000),
    (2, "ls", false, 1, 18000),
    (3, "ww", true, 2, 19000),
    (4, "zl", false, 1, 18000)
).toDF("id","name","sex","dept","salary")
df.groupBykey(row=>(row.getInt(3)).agg(typed.sumLong(row => row.getInt(4))).show() // // 聚合操作支持:count sum avg 三种
  1. pivot(透视):行转列的过程
val df = List(
    (1, "math", 85),
    (1, "chinese", 80),
    (1, "english", 90),
    (1, "english", 99),
    (2, "math", 90),
    (2, "chinese", 80)
  ).toDF("id", "course", "score")
// 方法1 case ...when...语句
df
  // 基于表达式的投影查询
  .selectExpr(
    "id",
    "case course when 'math' then score else 0 end as math",
    "case course when 'chinese' then score else 0 end as chinese",
    "case course when 'english' then score else 0 end as english"
  )
  .groupBy($"id")
  .max("math","chinese","english")
  .show()
// 方法二:pivot透视方法实现(更为简单)
df
.groupBy("id")
.pivot("course") // 将课程字段math、chinese、english转换化结果表的字段
.max("score") // 保留每个科目分数最高的结果
.show()
  1. na(空值处理)
df.groupBy("id").pivot("course").max("score") 
  //.na.fill(-1,Array[String]("english")) 
  .na.drop(Array[String]("math"))  
  .show()
  1. over 窗口函数
    求某天每个用户访问页面次数前10的页面,每个用户访问不同页面的次数.
def m10(spark: SparkSession): Unit = {
  val rdd = spark.sparkContext.makeRDD(
    List(
      ("2018-01-01", 1, "www.baidu.com", "10:01"),
      ("2018-01-01", 2, "www.baidu.com", "10:01"),
      ("2018-01-01", 1, "www.sina.com", "10:01"),
      ("2018-01-01", 3, "www.baidu.com", "10:01"),
      ("2018-01-01", 3, "www.baidu.com", "10:01"),
      ("2018-01-01", 1, "www.sina.com", "10:01")
    ))
  import spark.implicits._
  val df = rdd.toDF("day", "user_Id", "page_id", "time")
  // 1. 每个用户访问不同页面的次数
  // select user_id,page_id,count(page_id) from t_log group by user_id,page_id

  // 窗口函数声明
  val w1 = Window
  .partitionBy("user_id") // 将user_id相同的数据划分到同一个窗口中
  .orderBy($"count" desc)

  import org.apache.spark.sql.functions._

  df
  .selectExpr("user_Id", "page_id")
  .groupBy("user_id", "page_id")
  .count()
  // 用以描述排名信息
  .withColumn("rank", rank() over (w1))
  .where("rank <= 1")
  .show()

  spark.stop()
}
  1. 表链接 join| union
import org.apache.spark.sql.SparkSession

object SparkSQLWordCount2 {
  def main(args: Array[String]): Unit = {
    //1. 构建Spark SQL中核心对象SparkSession
    val spark = SparkSession.builder().appName("wordcount").master("local[*]").getOrCreate()

    val rdd1 = spark.sparkContext.makeRDD(List((1, "zs", "A"), (2, "ls", "B"), (3, "ww", "C")))
    val rdd2 = spark.sparkContext.makeRDD(List(("A", "市场部"), ("B", "后勤部")))
    import spark.implicits._
    val df1 = rdd1.toDF("id", "name", "dept_id")
    val df2 = rdd2.toDF("d_id", "dept_name")

    df1
      // 默认的连接类型为inner
      //.join(df2, $"dept_id" === $"d_id", "left_outer")
      .join(df2, $"dept_id" === $"d_id")
      .show()

    spark.stop()
  }
}



2.2 sql操作

// 统计每个用户访问次数前十的页面
val rdd3 = spark.sparkContext.makeRDD(
  List(
    ("2018-01-01", 1, "www.baidu.com", "10:01"),
    ("2018-01-01", 2, "www.baidu.com", "10:01"),
    ("2018-01-01", 1, "www.sina.com", "10:01"),
    ("2018-01-01", 3, "www.baidu.com", "10:01"),
    ("2018-01-01", 3, "www.baidu.com", "10:01"),
    ("2018-01-01", 1, "www.sina.com", "10:01")
  ))
import spark.implicits._
val df3 = rdd3.toDF("day", "user_Id", "page_id", "time")

df3.createOrReplaceTempView("t_log")

spark
.sql(
  """
          | select
          |   *
          | from
          |   (select
          |     user_id,
          |     page_id,
          |     num,
          |     rank() over (partition by user_id order by num desc) as rank
          |    from
          |       (select
          |         user_id,
          |         page_id,
          |         count(page_id) as num
          |        from t_log
          |        group by
          |         user_id,page_id
          |        )
          |   )
          | where rank <= 10
          |""".stripMargin)
.show()


spark
.sql(
  """
          | select
          |   id,
          |   name,
          |   sex,
          |   dept,
          |   salary,
          |   avg(salary) over(partition by dept order by salary desc rows between unbounded preceding and unbounded following) as avg_salary
          | from
          |   t_user
          |""".stripMargin)
.show()

3. 加载和保存数据

3.1 SparkSql加载文件

// load方法,默认加载的是parquet类型的文件。
var df = spark.read.load("data/users.parquet")

// 如果想要读取别的类型的数据,需要使用format方法来指定格式
val df1: DataFrame = spark.read.format("csv").load("data/country.csv")
df1.toDF("id","name","code").show()

val df2: DataFrame = spark.read.format("json").load("data/emp.json")
df2.show()

val df3: DataFrame = spark.read.format("parquet").load("data/sqldf.parquet")
df3.show()

val df4: DataFrame = spark.read.format("orc").load("data/student.orc")
df4.show()

//注意: 读取Text文件时,SparkSql只会将一行的内容解析成一列。
val df5: DataFrame = spark.read.format("text").load("data/dailykey.txt")
df5.show()

//option,用于指定分隔符, 将文件的第一行设置成表头: option("header","true")
val df6: DataFrame = spark.read.option("header","true").option("sep","|").format("csv").load("data/location-info.csv")
df6.show()

3.2 SparkSql保存文件

val frame: DataFrame = spark.read.json("data/emp.json")
frame.write.format("csv").save("out/csv")  //保存成csv格式时,如果是null,则默认保存成空字符串

frame.write.format("json").save("out/json")

frame.write.format("orc").save("out/orc")

frame.write.format("parquet").save("out/parquet")

// 将df保存成text格式时,只能将df的多列转成一列进行存储,否则报错
val rdd: RDD[Row] = frame.rdd

val rdd2: RDD[String] = rdd.map(row => {
    row.get(0) + "," + row.get(1) + "," + row.get(2) + "," + row.get(3) + "," + row.get(4) + "," + row.get(5) + "," + row.get(6) + "," + row.get(7)
})
//将新的RDD2转成DF再进行存储。
rdd2.toDF().write.format("text").save("out/text")


// SparkSql的存储方式的简化方式
frame.write.json("out1/json")

frame.write.option("sep","|").csv("out1/csv")

frame.write.parquet("out1/parquet")

4. UDF