文章目录

  • 一、Spark Sql概述
  • 1、定义
  • 2、DataFrame
  • 3、DataSet
  • 二、数据转换
  • <1>RDD <-> DataFrame
  • RDD -> DataFrame
  • DataFrame -> RDD
  • <2>RDD <-> DataSet
  • RDD -> DataSet
  • DataSet -> RDD
  • <3>DataFrame <-> DataSet
  • DataFrame -> DataSet
  • DataSet -> DataFrame
  • 三、Sql执行模式
  • 1、DSL模式
  • 2、SQL模式
  • 四、Spark Sql的一些简单操作
  • 五、自定义函数
  • 1、UDF函数
  • 2、UDAF函数
  • 3、开窗函数
  • 六、集成Hive
  • 1、使用内置的Hive
  • 2、集成外部的Hive
  • 七、Spark Sql的输入和输出
  • 1、输入
  • 2、输出


一、Spark Sql概述

1、定义

Spark Sql是Spark的一个模块,可以和RDD混合编程,支持标准的数据源,可以集成和替代Hive,可以提供jdbc,ODBC服务器功能。

spark sql 命令行 spark sql使用_Sql


spark sql 命令行 spark sql使用_Spark_02

Spark Sql中的数据抽象

spark sql 命令行 spark sql使用_Spark_03

2、DataFrame

Sql操作主要涉及到表的操作,表示数据和Schema的组成,所以DataFrame就是一张表=RDD+Schema

DataFrame的执行效率首先要比RDD高,主要表现在定制化的内存管理和优化的执行引擎。

DataFrame是一个弱类型数据对象,缺少数据类型安全检查和运行期检查,类似于java.sql.ResultSet类,只能通过getString这种方式来获取具体数据。

spark sql 命令行 spark sql使用_Sql_04

3、DataSet

DataSet具有DataFrame所有的好处,同时配合case class来实现强类型,具有局部序列化和反序列化功能。
同时DataFrame只知道字段,但是不知道字段类型,所以在执行这些操作的时候无法在编译的时候检查是否类型失败的,如对一个String进行加减乘除计算的时候,只有在执行的时候才会报错。然而DataSet不仅仅是知道字段,而且还知道字段类型,所以会有更严格的类型错误检查。

二、数据转换

<1>RDD <-> DataFrame
RDD -> DataFrame

1、从RDD手动转换

scala> val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[4] at textFile at <console>:26

scala> val peopleFrame = peopleRDD.map{x=>val para = x.split(",");(para(0),para(1).trim.toInt)}
peopleFrame: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:28

scala> import spark.implicits._
import spark.implicits._

scala> peopleFrame.toDF("name", "age")
res4: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> res4.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

2、通过反射

scala> var peopleRDD = sc.textFile("File:///home/hadoop1/modules/spark-2.1.1-bin-hadoop2.7/examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = File:///home/hadoop1/modules/spark-2.1.1-bin-hadoop2.7/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24

scala> peopleRDD.collect
res0: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)

scala> case class people(name:String,age:Int)
defined class people

scala> peopleRDD.map{x=>val para = x.split(",");people(para(0),para(1).trim.toInt)}.toDF
res1: org.apache.spark.sql.DataFrame = [name: string, age: int]

3、通过编程确定

scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._

scala> val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[18] at textFile at <console>:38

scala> val schema = StructType(StructField("name",StringType)::StructField("age",IntegerType)::Nil)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))

scala> import org.apache.spark.sql._
import org.apache.spark.sql._

scala> val data=peopleRDD.map{x => val para = x.split(",");Row(para(0),para(1).trim.toInt)}
data: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[19] at map at <console>:43

scala> spark.createDataFrame(data,schema)
res11: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> res11.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+
DataFrame -> RDD
scala> import res4.sparkSession.implicits._
import res4.sparkSession.implicits._

scala> res4.rdd
res6: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at <console>:39

scala> res4.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

scala> res6.map(_.getString(0)).collect
res8: Array[String] = Array(Michael, Andy, Justin)
<2>RDD <-> DataSet
RDD -> DataSet
scala> case class People(name:String, age:Int)
defined class People

scala> peopleRDD.map{x => val para = x.split(",");People(para(0),para(1).trim.toInt)}.toDS
res13: org.apache.spark.sql.Dataset[People] = [name: string, age: int]

scala> res13.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+
DataSet -> RDD

直接使用dataset.rdd即可,返回的是RDD【People】

scala> res13.rdd
res15: org.apache.spark.rdd.RDD[People] = MapPartitionsRDD[28] at rdd at <console>:48

scala> res15.show
<console>:50: error: value show is not a member of org.apache.spark.rdd.RDD[People]
       res15.show
             ^

scala> res15.collect
res17: Array[People] = Array(People(Michael,29), People(Andy,30), People(Justin,19))
<3>DataFrame <-> DataSet
DataFrame -> DataSet

schema需要借助case class,[DF的列名要和case class列名一致]

scala> case class People(name:String, age:Int)
defined class People

scala> res4.as[People]
res18: org.apache.spark.sql.Dataset[People] = [name: string, age: int]

scala> res4.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+
DataSet -> DataFrame

直接使用dataSet.toDF即可,直接服用case class的名称

scala> res18.toDF
res20: org.apache.spark.sql.DataFrame = [name: string, age: int]

三、Sql执行模式

1、DSL模式

调用方法来实现

scala> res20.select($"name", $"age"+1).show
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|       30|
|   Andy|       31|
| Justin|       20|
+-------+---------+
2、SQL模式

通过执行SQL语句来实现
创建一张表,session内可以使用,一个sparkSession后,表自动删除,使用表名不需要加任何的前缀

scala> val sqlDF = spark.sql("select * from people")
sqlDF: org.apache.spark.sql.DataFrame = [name: string, age: int]

scala> sqlDF.show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

应用级别内可以访问,一个sparkContext后,表自动删除,使用表名需要加上global_temp.表名

scala> res20.createGlobalTempView("people")

scala> spark.sql("select * from global_temp.people").show
+-------+---+
|   name|age|
+-------+---+
|Michael| 29|
|   Andy| 30|
| Justin| 19|
+-------+---+

四、Spark Sql的一些简单操作

在idea编译器上使用

package sparkSession

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object spark_sql {

  def main(args: Array[String]): Unit = {

    val sparkConf = new SparkConf().setAppName("spark_sql").setMaster("local[*]")

    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    val employee = spark.read.json("")
    employee.show()

    employee.select("name").show()

    employee.createOrReplaceTempView("ee")
    spark.sql("select * from ee").show()

    spark.stop()
  }
}

五、自定义函数

1、UDF函数

通过spark.udf.register(“name”, func)来进行注册
使用:select add(name) from people 来直接使用

scala> val df = spark.read.json("File:///home/hadoop1/modules/spark-2.1.1-bin-hadoop2.7/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+
2、UDAF函数

<1>弱类型
注册一个UDAF函数,直接使用类似于UDF函数

package sparkSession

import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._

//{"name":"Michael", "salary":3000}
//{"name":"Andy", "salary":4500}
//{"name":"Justin", "salary":3500}
//{"name":"Berta", "salary":4000}

class AverageSal extends UserDefinedAggregateFunction{

  //输入数据类型
  override def inputSchema: StructType = StructType(StructField("salary",LongType) ::Nil)

  //设置每一个分区中的共享变量
  override def bufferSchema: StructType = StructType(StructField("sum",LongType) :: StructField("count",IntegerType) ::Nil)

  //udf输出类型
  override def dataType: DataType = DoubleType

  //有相同的输入是否存在相同的输出,如果是为true
  override def deterministic: Boolean = true

  //初始化共享变量
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer(0) = 0L
    buffer(1) = 0
  }

  //每一个分区每一条数据聚合的时候调用该方法
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    //获取每一行的工资加到sum中
    buffer(0) = buffer.getLong(0) + input.getLong(0)
    //将工资数加1
    buffer(1) = buffer.getInt(1) + 1
  }

  //合并每一个分区中的数据
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
    buffer1(1) = buffer1.getInt(1) + buffer2.getInt(1)
  }

  //最后输出结果
  override def evaluate(buffer: Row): Any = {
    buffer.getLong(0).toDouble / buffer.getInt(1)
  }
}

object AverageSal{
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("average").setMaster("local[*]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    val employees = spark.read.json("/Users/xiaotongxin/IdeaProjects/spark2/sparksql/sparkhelloworld/src/main/resources/employees.json")
    //sql模式
    employees.createOrReplaceTempView("employee")
    //注册udaf
    spark.udf.register("average",new AverageSal)
    spark.sql("select average(salary) from employee").show()
    spark.stop()
  }
}

<2>强类型
需要一个DataSet,以此配置输入,共享变量,输出,需要用到case class

package sparkSession

import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator


case class Employee(name:String,salary:Long)
case class Aver(var sum:Long,var count:Int)

class Average extends Aggregator[Employee,Aver,Double]{

  //初始化共享变量
  override def zero: Aver = Aver(0L,0)

  //每一个分区每一条数据聚合的时候调用该方法
  override def reduce(b: Aver, a: Employee): Aver = {
    b.sum += a.salary
    b.count += 1
    b
  }

  //合并每一个分区中的数据
  override def merge(b1: Aver, b2: Aver): Aver = {
    b1.sum = b1.sum + b2.sum
    b1.count = b1.count + b2.count
    b1
  }

  //输出最终的计算结果
  override def finish(reduction: Aver): Double = reduction.sum.toDouble / reduction.count

  //设置共享变量的编码器
  override def bufferEncoder: Encoder[Aver] = Encoders.product

  //设置最后输出值的编码器
  override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

object Average{
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("aver").setMaster("local[*]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    import spark.implicits._
    val ee = spark.read.json("/Users/xiaotongxin/IdeaProjects/spark2/sparksql/sparkhelloworld/src/main/resources/employees.json").as[Employee]
    println("原始数据")
    ee.show()
    //注册
    val averageSalary = new Average().toColumn.name("average_salary")
    ee.select(averageSalary).show()
    spark.stop()
  }
}
3、开窗函数
"rank()跳跃排序,有两个第二名时后边跟着的是第四名\n" +
	    "dense_rank() 连续排序,有两个第二名时仍然跟着第三名\n" +
	    "over()开窗函数:\n" +
	    "       在使用聚合函数后,会将多行变成一行,而开窗函数是将一行变成多行;\n" +
	    "       并且在使用聚合函数后,如果要显示其他的列必须将列加入到group by中,\n" +
	    "       而使用开窗函数后,可以不使用group by,直接将所有信息显示出来。\n" +
	    "        开窗函数适用于在每一行的最后一列添加聚合函数的结果。\n" +
	    "常用开窗函数:\n" +
	    "   1.为每条数据显示聚合信息.(聚合函数() over())\n" +
	    "   2.为每条数据提供分组的聚合函数结果(聚合函数() over(partition by 字段) as 别名) \n" +
	    "         --按照字段分组,分组后进行计算\n" +
	    "   3.与排名函数一起使用(row number() over(order by 字段) as 别名)\n" +
	    "常用分析函数:(最常用的应该是1.2.3 的排序)\n" +
	    "   1、row_number() over(partition by ... order by ...)\n" +
	    "   2、rank() over(partition by ... order by ...)\n" +
	    "   3、dense_rank() over(partition by ... order by ...)\n" +
	    "   4、count() over(partition by ... order by ...)\n" +
	    "   5、max() over(partition by ... order by ...)\n" +
	    "   6、min() over(partition by ... order by ...)\n" +
	    "   7、sum() over(partition by ... order by ...)\n" +
	    "   8、avg() over(partition by ... order by ...)\n" +
	    "   9、first_value() over(partition by ... order by ...)\n" +
	    "   10、last_value() over(partition by ... order by ...)\n" +
	    "   11、lag() over(partition by ... order by ...)\n" +
	    "   12、lead() over(partition by ... order by ...)\n" +
	    "lag 和lead 可以 获取结果集中,按一定排序所排列的当前行的上下相邻若干offset 的某个行的某个列(不用结果集的自关联);\n" +
	    "lag ,lead 分别是向前,向后;\n" +
	    "lag 和lead 有三个参数,第一个参数是列名,第二个参数是偏移的offset,第三个参数是 超出记录窗口时的默认值")

使用方法如下:

package sparkSession

import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

case class Score(name:String,clazz:Int,score:Int)

object OverFunction {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("over").setMaster("local[*]")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    //隐式转换
    import spark.implicits._
    println("//***************  原始的班级表  ****************//")
    val scoreDF = spark.sparkContext.makeRDD(Array( Score("a", 1, 80),
                                                    Score("b", 1, 78),
                                                    Score("c", 1, 95),
                                                    Score("d", 2, 74),
                                                    Score("e", 2, 92),
                                                    Score("f", 3, 99),
                                                    Score("g", 3, 99),
                                                    Score("h", 3, 45),
                                                    Score("i", 3, 55),
                                                    Score("j", 3, 78))).toDF("name","class","score")
    scoreDF.createOrReplaceTempView("score")
    scoreDF.show()
    println("//***************  求每个班最高成绩学生的信息  ***************/")
    println("    /*******  开窗函数的表  ********/")
    spark.sql("select name,class,score,rank() over(partition by class order by score desc) rank from score").show()

    println("/********计算结果的表************/")
    spark.sql("select * from(select name,class,score,rank() over(partition by class order by score desc) rank from score)as a where a.rank=1").show()

    println("/**************  求每个班最高成绩学生的信息(groupBY)  ***************/")
    spark.sql("select a.name,b.class,b.max from score as a,(select class,max(score) max from score group by class)as b where a.score=b.max").show()

    spark.stop()
  }
}

六、集成Hive

1、使用内置的Hive

如果发现master节点有matestore_db删除,直接启动客户端–conf spark.sql.warehouse.dir=hdfs://hadooop1:9000/spark_warehouse

注意:如果使用的是内置的Hive,在Spark2.0之后,spark.sql.warehouse.dir用来指定数据仓库的地址,如果是用HDFS作为路径,那么需要将core-site.xml和hdfs-site.xml拷贝到Spark conf目录中,否则只会创建master节点上的warehouse目录,查询时会出现文件找不到问题,这是需要用HDFS,则需要将metastore删除,重新启动集群。
2、集成外部的Hive
  • 将HIve conf目录下的hive-site.xml文件软连接到spark conf下
  • 启动spark-shell或者是spark-sql直接接管hive
  • 接管的核心是整合了元数据库,所以需要有元数据库的驱动存在,Hive的状态对于接管没有影响。

七、Spark Sql的输入和输出

1、输入

spark.read.csv(path)
spark.read.json(path)
spark.read.jdbc(path)
spark.read.load(path)
spark.read.orc(path)
spark.read.parquet()
spark.read.text(path)
spark.read.table(path)
spark.read.textFile(path)

textFile是DataSet方法,其他方法都是DataFrame的方法

spark.read.format("").load(path) 需要制定format格式,不然是默认parquet格式

2、输出

dataFrame.write.csv(path)
dataFrame.write.json(path)
dataFrame.write.jdbc(path)
dataFrame.write.load(path)
dataFrame.write.orc(path)
dataFrame.write.parquet(path)
dataFrame.write.text(path)
dataFrame.write.table(path)
dataFrame.write.textFile(path)