文章目录
- 一、Spark Sql概述
- 1、定义
- 2、DataFrame
- 3、DataSet
- 二、数据转换
- <1>RDD <-> DataFrame
- RDD -> DataFrame
- DataFrame -> RDD
- <2>RDD <-> DataSet
- RDD -> DataSet
- DataSet -> RDD
- <3>DataFrame <-> DataSet
- DataFrame -> DataSet
- DataSet -> DataFrame
- 三、Sql执行模式
- 1、DSL模式
- 2、SQL模式
- 四、Spark Sql的一些简单操作
- 五、自定义函数
- 1、UDF函数
- 2、UDAF函数
- 3、开窗函数
- 六、集成Hive
- 1、使用内置的Hive
- 2、集成外部的Hive
- 七、Spark Sql的输入和输出
- 1、输入
- 2、输出
一、Spark Sql概述
1、定义
Spark Sql是Spark的一个模块,可以和RDD混合编程,支持标准的数据源,可以集成和替代Hive,可以提供jdbc,ODBC服务器功能。
Spark Sql中的数据抽象
2、DataFrame
Sql操作主要涉及到表的操作,表示数据和Schema的组成,所以DataFrame就是一张表=RDD+Schema
DataFrame的执行效率首先要比RDD高,主要表现在定制化的内存管理和优化的执行引擎。
DataFrame是一个弱类型数据对象,缺少数据类型安全检查和运行期检查,类似于java.sql.ResultSet类,只能通过getString这种方式来获取具体数据。
3、DataSet
DataSet具有DataFrame所有的好处,同时配合case class来实现强类型,具有局部序列化和反序列化功能。
同时DataFrame只知道字段,但是不知道字段类型,所以在执行这些操作的时候无法在编译的时候检查是否类型失败的,如对一个String进行加减乘除计算的时候,只有在执行的时候才会报错。然而DataSet不仅仅是知道字段,而且还知道字段类型,所以会有更严格的类型错误检查。
二、数据转换
<1>RDD <-> DataFrame
RDD -> DataFrame
1、从RDD手动转换
scala> val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[4] at textFile at <console>:26
scala> val peopleFrame = peopleRDD.map{x=>val para = x.split(",");(para(0),para(1).trim.toInt)}
peopleFrame: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[5] at map at <console>:28
scala> import spark.implicits._
import spark.implicits._
scala> peopleFrame.toDF("name", "age")
res4: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> res4.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
2、通过反射
scala> var peopleRDD = sc.textFile("File:///home/hadoop1/modules/spark-2.1.1-bin-hadoop2.7/examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = File:///home/hadoop1/modules/spark-2.1.1-bin-hadoop2.7/examples/src/main/resources/people.txt MapPartitionsRDD[1] at textFile at <console>:24
scala> peopleRDD.collect
res0: Array[String] = Array(Michael, 29, Andy, 30, Justin, 19)
scala> case class people(name:String,age:Int)
defined class people
scala> peopleRDD.map{x=>val para = x.split(",");people(para(0),para(1).trim.toInt)}.toDF
res1: org.apache.spark.sql.DataFrame = [name: string, age: int]
3、通过编程确定
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
peopleRDD: org.apache.spark.rdd.RDD[String] = examples/src/main/resources/people.txt MapPartitionsRDD[18] at textFile at <console>:38
scala> val schema = StructType(StructField("name",StringType)::StructField("age",IntegerType)::Nil)
schema: org.apache.spark.sql.types.StructType = StructType(StructField(name,StringType,true), StructField(age,IntegerType,true))
scala> import org.apache.spark.sql._
import org.apache.spark.sql._
scala> val data=peopleRDD.map{x => val para = x.split(",");Row(para(0),para(1).trim.toInt)}
data: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[19] at map at <console>:43
scala> spark.createDataFrame(data,schema)
res11: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> res11.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
DataFrame -> RDD
scala> import res4.sparkSession.implicits._
import res4.sparkSession.implicits._
scala> res4.rdd
res6: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[12] at rdd at <console>:39
scala> res4.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
scala> res6.map(_.getString(0)).collect
res8: Array[String] = Array(Michael, Andy, Justin)
<2>RDD <-> DataSet
RDD -> DataSet
scala> case class People(name:String, age:Int)
defined class People
scala> peopleRDD.map{x => val para = x.split(",");People(para(0),para(1).trim.toInt)}.toDS
res13: org.apache.spark.sql.Dataset[People] = [name: string, age: int]
scala> res13.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
DataSet -> RDD
直接使用dataset.rdd即可,返回的是RDD【People】
scala> res13.rdd
res15: org.apache.spark.rdd.RDD[People] = MapPartitionsRDD[28] at rdd at <console>:48
scala> res15.show
<console>:50: error: value show is not a member of org.apache.spark.rdd.RDD[People]
res15.show
^
scala> res15.collect
res17: Array[People] = Array(People(Michael,29), People(Andy,30), People(Justin,19))
<3>DataFrame <-> DataSet
DataFrame -> DataSet
schema需要借助case class,[DF的列名要和case class列名一致]
scala> case class People(name:String, age:Int)
defined class People
scala> res4.as[People]
res18: org.apache.spark.sql.Dataset[People] = [name: string, age: int]
scala> res4.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
DataSet -> DataFrame
直接使用dataSet.toDF即可,直接服用case class的名称
scala> res18.toDF
res20: org.apache.spark.sql.DataFrame = [name: string, age: int]
三、Sql执行模式
1、DSL模式
调用方法来实现
scala> res20.select($"name", $"age"+1).show
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| 30|
| Andy| 31|
| Justin| 20|
+-------+---------+
2、SQL模式
通过执行SQL语句来实现
创建一张表,session内可以使用,一个sparkSession后,表自动删除,使用表名不需要加任何的前缀
scala> val sqlDF = spark.sql("select * from people")
sqlDF: org.apache.spark.sql.DataFrame = [name: string, age: int]
scala> sqlDF.show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
应用级别内可以访问,一个sparkContext后,表自动删除,使用表名需要加上global_temp.表名
scala> res20.createGlobalTempView("people")
scala> spark.sql("select * from global_temp.people").show
+-------+---+
| name|age|
+-------+---+
|Michael| 29|
| Andy| 30|
| Justin| 19|
+-------+---+
四、Spark Sql的一些简单操作
在idea编译器上使用
package sparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
object spark_sql {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("spark_sql").setMaster("local[*]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val employee = spark.read.json("")
employee.show()
employee.select("name").show()
employee.createOrReplaceTempView("ee")
spark.sql("select * from ee").show()
spark.stop()
}
}
五、自定义函数
1、UDF函数
通过spark.udf.register(“name”, func)来进行注册
使用:select add(name) from people 来直接使用
scala> val df = spark.read.json("File:///home/hadoop1/modules/spark-2.1.1-bin-hadoop2.7/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
2、UDAF函数
<1>弱类型
注册一个UDAF函数,直接使用类似于UDF函数
package sparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types._
//{"name":"Michael", "salary":3000}
//{"name":"Andy", "salary":4500}
//{"name":"Justin", "salary":3500}
//{"name":"Berta", "salary":4000}
class AverageSal extends UserDefinedAggregateFunction{
//输入数据类型
override def inputSchema: StructType = StructType(StructField("salary",LongType) ::Nil)
//设置每一个分区中的共享变量
override def bufferSchema: StructType = StructType(StructField("sum",LongType) :: StructField("count",IntegerType) ::Nil)
//udf输出类型
override def dataType: DataType = DoubleType
//有相同的输入是否存在相同的输出,如果是为true
override def deterministic: Boolean = true
//初始化共享变量
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0) = 0L
buffer(1) = 0
}
//每一个分区每一条数据聚合的时候调用该方法
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
//获取每一行的工资加到sum中
buffer(0) = buffer.getLong(0) + input.getLong(0)
//将工资数加1
buffer(1) = buffer.getInt(1) + 1
}
//合并每一个分区中的数据
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
buffer1(1) = buffer1.getInt(1) + buffer2.getInt(1)
}
//最后输出结果
override def evaluate(buffer: Row): Any = {
buffer.getLong(0).toDouble / buffer.getInt(1)
}
}
object AverageSal{
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("average").setMaster("local[*]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
val employees = spark.read.json("/Users/xiaotongxin/IdeaProjects/spark2/sparksql/sparkhelloworld/src/main/resources/employees.json")
//sql模式
employees.createOrReplaceTempView("employee")
//注册udaf
spark.udf.register("average",new AverageSal)
spark.sql("select average(salary) from employee").show()
spark.stop()
}
}
<2>强类型
需要一个DataSet,以此配置输入,共享变量,输出,需要用到case class
package sparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.{Encoder, Encoders, SparkSession}
import org.apache.spark.sql.expressions.Aggregator
case class Employee(name:String,salary:Long)
case class Aver(var sum:Long,var count:Int)
class Average extends Aggregator[Employee,Aver,Double]{
//初始化共享变量
override def zero: Aver = Aver(0L,0)
//每一个分区每一条数据聚合的时候调用该方法
override def reduce(b: Aver, a: Employee): Aver = {
b.sum += a.salary
b.count += 1
b
}
//合并每一个分区中的数据
override def merge(b1: Aver, b2: Aver): Aver = {
b1.sum = b1.sum + b2.sum
b1.count = b1.count + b2.count
b1
}
//输出最终的计算结果
override def finish(reduction: Aver): Double = reduction.sum.toDouble / reduction.count
//设置共享变量的编码器
override def bufferEncoder: Encoder[Aver] = Encoders.product
//设置最后输出值的编码器
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
object Average{
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("aver").setMaster("local[*]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
import spark.implicits._
val ee = spark.read.json("/Users/xiaotongxin/IdeaProjects/spark2/sparksql/sparkhelloworld/src/main/resources/employees.json").as[Employee]
println("原始数据")
ee.show()
//注册
val averageSalary = new Average().toColumn.name("average_salary")
ee.select(averageSalary).show()
spark.stop()
}
}
3、开窗函数
"rank()跳跃排序,有两个第二名时后边跟着的是第四名\n" +
"dense_rank() 连续排序,有两个第二名时仍然跟着第三名\n" +
"over()开窗函数:\n" +
" 在使用聚合函数后,会将多行变成一行,而开窗函数是将一行变成多行;\n" +
" 并且在使用聚合函数后,如果要显示其他的列必须将列加入到group by中,\n" +
" 而使用开窗函数后,可以不使用group by,直接将所有信息显示出来。\n" +
" 开窗函数适用于在每一行的最后一列添加聚合函数的结果。\n" +
"常用开窗函数:\n" +
" 1.为每条数据显示聚合信息.(聚合函数() over())\n" +
" 2.为每条数据提供分组的聚合函数结果(聚合函数() over(partition by 字段) as 别名) \n" +
" --按照字段分组,分组后进行计算\n" +
" 3.与排名函数一起使用(row number() over(order by 字段) as 别名)\n" +
"常用分析函数:(最常用的应该是1.2.3 的排序)\n" +
" 1、row_number() over(partition by ... order by ...)\n" +
" 2、rank() over(partition by ... order by ...)\n" +
" 3、dense_rank() over(partition by ... order by ...)\n" +
" 4、count() over(partition by ... order by ...)\n" +
" 5、max() over(partition by ... order by ...)\n" +
" 6、min() over(partition by ... order by ...)\n" +
" 7、sum() over(partition by ... order by ...)\n" +
" 8、avg() over(partition by ... order by ...)\n" +
" 9、first_value() over(partition by ... order by ...)\n" +
" 10、last_value() over(partition by ... order by ...)\n" +
" 11、lag() over(partition by ... order by ...)\n" +
" 12、lead() over(partition by ... order by ...)\n" +
"lag 和lead 可以 获取结果集中,按一定排序所排列的当前行的上下相邻若干offset 的某个行的某个列(不用结果集的自关联);\n" +
"lag ,lead 分别是向前,向后;\n" +
"lag 和lead 有三个参数,第一个参数是列名,第二个参数是偏移的offset,第三个参数是 超出记录窗口时的默认值")
使用方法如下:
package sparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
case class Score(name:String,clazz:Int,score:Int)
object OverFunction {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("over").setMaster("local[*]")
val spark = SparkSession.builder().config(sparkConf).getOrCreate()
//隐式转换
import spark.implicits._
println("//*************** 原始的班级表 ****************//")
val scoreDF = spark.sparkContext.makeRDD(Array( Score("a", 1, 80),
Score("b", 1, 78),
Score("c", 1, 95),
Score("d", 2, 74),
Score("e", 2, 92),
Score("f", 3, 99),
Score("g", 3, 99),
Score("h", 3, 45),
Score("i", 3, 55),
Score("j", 3, 78))).toDF("name","class","score")
scoreDF.createOrReplaceTempView("score")
scoreDF.show()
println("//*************** 求每个班最高成绩学生的信息 ***************/")
println(" /******* 开窗函数的表 ********/")
spark.sql("select name,class,score,rank() over(partition by class order by score desc) rank from score").show()
println("/********计算结果的表************/")
spark.sql("select * from(select name,class,score,rank() over(partition by class order by score desc) rank from score)as a where a.rank=1").show()
println("/************** 求每个班最高成绩学生的信息(groupBY) ***************/")
spark.sql("select a.name,b.class,b.max from score as a,(select class,max(score) max from score group by class)as b where a.score=b.max").show()
spark.stop()
}
}
六、集成Hive
1、使用内置的Hive
如果发现master节点有matestore_db删除,直接启动客户端–conf spark.sql.warehouse.dir=hdfs://hadooop1:9000/spark_warehouse
注意:如果使用的是内置的Hive,在Spark2.0之后,spark.sql.warehouse.dir用来指定数据仓库的地址,如果是用HDFS作为路径,那么需要将core-site.xml和hdfs-site.xml拷贝到Spark conf目录中,否则只会创建master节点上的warehouse目录,查询时会出现文件找不到问题,这是需要用HDFS,则需要将metastore删除,重新启动集群。
2、集成外部的Hive
- 将HIve conf目录下的hive-site.xml文件软连接到spark conf下
- 启动spark-shell或者是spark-sql直接接管hive
- 接管的核心是整合了元数据库,所以需要有元数据库的驱动存在,Hive的状态对于接管没有影响。
七、Spark Sql的输入和输出
1、输入
spark.read.csv(path)
spark.read.json(path)
spark.read.jdbc(path)
spark.read.load(path)
spark.read.orc(path)
spark.read.parquet()
spark.read.text(path)
spark.read.table(path)
spark.read.textFile(path)
textFile是DataSet方法,其他方法都是DataFrame的方法
spark.read.format("").load(path) 需要制定format格式,不然是默认parquet格式
2、输出
dataFrame.write.csv(path)
dataFrame.write.json(path)
dataFrame.write.jdbc(path)
dataFrame.write.load(path)
dataFrame.write.orc(path)
dataFrame.write.parquet(path)
dataFrame.write.text(path)
dataFrame.write.table(path)
dataFrame.write.textFile(path)