SparkSQL的API(章节四)
Spark SQL是⽤于结构化数据处理的⼀个模块。同Spark RDD 不同地⽅在于Spark SQL的API可以给Spark计算引擎提供更多地 信息,例如:数据结构、计算算⼦等。在内部Spark可以通过这些信息有针对对任务做优化和调整。这⾥有⼏种⽅式和Spark SQL进⾏交互,例如Dataset API和SQL等,这两种API可以混合使⽤。Spark SQL的⼀个⽤途是执⾏SQL查询。 Spark SQL还可⽤于从现有Hive安装中读取数据。从其他编程语⾔中运⾏SQL时,结果将作为Dataset/DataFrame返回,使⽤命令 ⾏或JDBC / ODBC与SQL接⼝进⾏交互。
Dataset是⼀个分布式数据集合在Spark 1.6提供⼀个新的接⼝,Dataset提供RDD的优势(强类型,使⽤强⼤的lambda函 数)以及具备了Spark SQL执⾏引擎的优点。Dataset可以通过JVM对象构建,然后可以使⽤转换函数等(例如:map、flatMap、filter等),⽬前Dataset API⽀持Scala和Java ⽬前Python对Dataset⽀持还不算完备。
DataFrame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地⽅构建,⽐如说结构化数据⽂ 件、hive中的表或者外部数据库,使⽤Dataset[row]的数据集,可以理解DataFrame就是⼀个Dataset[Row].
SparkSession
Spark中所有功能的⼊⼝点是SparkSession类。要创建基本的SparkSession,只需使⽤:
SparkSession.builder()
引入依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<build>
<plugins>
<!--scala编译插件-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>4.0.1</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<!--创建fatjar插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
<!--编译插件-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.2</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
Driver程序
object SparkSqlHello01 {
def main(args: Array[String]): Unit = {
// 创建sparkSession
val spark = SparkSession.builder()
.appName("hello world")
.master("local[*]")
.getOrCreate()
// 引入隐式转换,将集合或RDD转换为DateSet、DateFrame
import spark.implicits._
// 设置日志级别
spark.sparkContext.setLogLevel("FATAL")
// 创建DateSet或DateFrame
val users = List(new User(1, "张三", 1, 1000)
,new User(2, "李四", 2, 1500)
,new User(3, "王五", 1, 2000))
val UserDataFrame = users.toDF()
// 注册表方便表操作
UserDataFrame.createOrReplaceTempView("t_user")
// sparkSql提供的算子或Sql脚本
val resultFrame = spark.sql("select id,name,deptNo,salary,salary*12 annual_salary from t_user")
// 将SQl结果写出到外围系统
resultFrame.show()
// 关流
spark.stop()
}
}
Dataset
Dataset与RDD类似,但是它们不使⽤Java序列化或Kryo,⽽是使⽤专⽤的Encoder来序列化对象以便通过⽹络进⾏处理或传输。虽然Encoder和标准序列化都负责将对象转换为字节,但Encoder是动态⽣成的代码,并使⽤⼀种格式,允许Spark执⾏许多操作,如过滤,排序和散列,⽽⽆需将字节反序列化为对象。
case class
case class User (id:Int,name:String,deptNo:Int,salary:Double)
val UserDataSet: Dataset[User] = List(new User(1, "张三", 1, 1000)
, new User(2, "李四", 2, 1500)
, new User(3, "王五", 1, 2000)).toDS()
// 注册表方便表操作
UserDataSet.createOrReplaceTempView("t_user")
// sparkSql提供的算子或Sql脚本
val resultSet = spark.sql("select id,name,deptNo,salary,salary*12 annual_salary from t_user")
// 将SQl结果写出到外围系统
resultSet.show()
Tuple元组
val dataset: Dataset[(Int,String,Int,Boolean)] = List((1,"zhangsan",18,true),(2,"wangwu",28,true)).toDS()
dataset.select($"_1",$"_2").show()
//或者
dataset.selectExpr("_1 as id","_2 as name","(_3 * 10) as age").show()
json数据
创建一个文件存放json数据
{"name":"张三","age":18}
{"name":"lisi","age":28}
{"name":"wangwu","age":38}
case class Person(id:Long,name:String,age:Long,sex:Boolean) //注意:Int类型的数据映射为Long
val dataset = spark.read.json("D:///Persion.json").as[Person]
dataset.show()
rdd
元组
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
userRDD.toDS().show()
case-class
val userRDD = spark.sparkContext.makeRDD(List(User(1,"张三",true,18,15000.0)))
userRDD.toDS().show()
DataFrame
Data Frame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地⽅构建,⽐如说结构化数据⽂件、hive中的表或者外部数据库,使⽤Dataset[row]的数据集,可以理解DataFrame就是⼀个Dataset[Row].
json⽂件
val frame = spark.read.json("file:///f:/person.json")
frame.show()
case-class
List(Person("zhangsan",18),Person("王五",20)).toDF("uname","uage").show()
Tuple元组
List(("zhangsan",18),("王五",20)).toDF("name","age").show()
RDD转换
Row
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
.map(t=>Row(t._1,t._2,t._3,t._4,t._5))
var schema=new StructType()
.add("id","int")
.add("name","string")
.add("sex","boolean")
.add("age","int")
.add("salary","double")
spark.createDataFrame(userRDD,schema).show()
Javabean
val userRDD = spark.sparkContext.makeRDD(List(new User(1,"张三",true,18,15000.0)))
spark.createDataFrame(userRDD,classOf[User]).show()
这⾥的 User 须是JavaBean对象。如果是Scala的类,⽤户需要额外提供getXxx⽅法(没这个必要)
case-class
val userRDD = spark.sparkContext.makeRDD(List(User(1,"张三",true,18,15000.0)))
spark.createDataFrame(userRDD).show() //等价于userRDD.toDF().show()
tuple元组
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
spark.createDataFrame(userRDD).show()
总结:
任意数据 --> RDD
RDD内部元素-->元组、样例类
toDF
Dataset/DataFrame API操作
准备数据
1,Michael,false,29,2000
5,Lisa,false,19,1000
3,Justin,true,19,1000
2,Andy,true,30,5000
4,Kaine,false,20,5000
-尝试将⽂本数据转变为DataFrame
case class User01(id:Int,name:String,sex:Boolean,age:Int,salary:Double)
val userlines = spark.sparkContext.makeRDD(Users)
var userRDD:RDD[User01]=userlines.map(line=>line.split(",")).map(ts=>new User01(ts(0).toInt,ts(1),ts(2).toBoolean,ts(3).toInt,ts(4).toDouble))
val userDataFrame = userRDD.toDF()
printSchema
打印表结构信息
userDataFrame.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- sex: boolean (nullable = false)
|-- age: integer (nullable = false)
|-- salary: double (nullable = false)
show
默认将dataframe或者是dataset中前20⾏的数据打印在控制台,⼀般⽤于测试。
userDataFrame.show()
+---+-------+-----+---+------+
| id| name| sex|age|salary|
+---+-------+-----+---+------+
| 1|Michael|false| 29|2000.0|
| 2| Andy| true| 30|5000.0|
| 3| Justin| true| 19|1000.0|
| 4| Kaine|false| 20|5000.0|
| 5| Lisa|false| 19|1000.0|
+---+-------+-----+---+------+
例如只查询前2⾏ userDataFrame.show(2)
+---+-------+-----+---+------+
| id| name| sex|age|salary|
+---+-------+-----+---+------+
| 1|Michael|false| 29|2000.0|
| 2| Andy| true| 30|5000.0|
+---+-------+-----+---+------+
select
等价于sql脚本的select语句,⽤于过滤、投影出需要的字段信息。⽤户可以直接给列名,但是不⽀持计算
userDataFrame.select("id","name","sex","age","salary")
.show()
⽤户可以给select传递Cloumn,这样⽤户可以针对Column做⼀些简单的计算
userDataFrame.select(new Column("id"),new Column("name"),new Column("age"),new
Column("salary"),new Column("salary").*(12))
.show()
简化写法如下:
userDataFrame.select($"id",$"name",$"age",$"salary",$"salary" * 12 as "annual_salary")
.show()
+---+-------+---+------+-------------+
| id| name|age|salary|annual_salary|
+---+-------+---+------+-------------+
| 1|Michael| 29|2000.0| 24000.0|
| 2| Andy| 30|5000.0| 60000.0|
| 3| Justin| 19|1000.0| 12000.0|
| 4| Kaine| 20|5000.0| 60000.0|
| 5| Lisa| 19|1000.0| 12000.0|
+---+-------+---+------+-------------+
selectExpr
允许直接给字段名,并且基于字段名指定⼀些常⻅字符串SQL运算符。
userDataFrame.selectExpr("id","name || '用户'","salary * 12 as annal_salary").show()
+---+------------------+------------+
| id|concat(name, ⽤户)|annal_salary|
+---+------------------+------------+
| 1| Michael⽤户| 24000.0|
| 2| Andy⽤户| 60000.0|
| 3| Justin⽤户| 12000.0|
| 4| Kaine⽤户| 60000.0|
| 5| Lisa⽤户| 12000.0|
+---+------------------+------------+
where
类似SQL中的where,主要⽤于过滤查询结果。该算⼦可以传递Conditiion或者ConditionExp
userDataFrame.select($"id",$"name",$"age",$"salary",$"salary" * 12)
.where($"name" like "%a%")
.show()
等价写法:
userDataFrame.select($"id",$"name",$"age",$"salary",$"salary" * 12)
.where("name like '%a%'") //where表达式
.show()
注意spark中别名不要出现中⽂,如果出现中⽂,在 where表达式 中存在bug
userDataFrame.select($"id",$"name",$"age",$"salary",$"salary" * 12 as "annal_salary")
.where("(name like '%a%') and (annal_salary > 12000)" )
.show() //正常
userDataFrame.select($"id",$"name",$"age",$"salary",$"salary" * 12 as "年薪")
.where("(name like '%a%') and ('年薪' > 12000)" )
.show()//错误
userDataFrame.select($"id",$"name",$"age",$"salary",$"salary" * 12 as "年薪")
.where($"name" like "%a%" and $"年薪" > 12000 )
.show() //正常
withColumn
可以给dataframe添加⼀个字段信息
userDataFrame.select($"id",$"name",$"age",$"salary",$"sex")
.withColumn("年薪",$"salary" * 12)
.show()
+---+-------+---+------+-----+-------+
| id| name|age|salary| sex| 年薪|
+---+-------+---+------+-----+-------+
| 1|Michael| 29|2000.0|false|24000.0|
| 2| Andy| 30|5000.0| true|60000.0|
| 3| Justin| 19|1000.0| true|12000.0|
| 4| Kaine| 20|5000.0|false|60000.0|
| 5| Lisa| 19|1000.0|false|12000.0|
+---+-------+---+------+-----+-------+
同时使用select和withcolumn时注意,一但select中起了别名,下面也要跟着改变,使用别名
withColumnRenamed
修改现有字段名字
userDataFrame.select($"id",$"name",$"age",$"salary",$"sex")
.withColumn("年薪",$"salary" * 12)
.withColumnRenamed("年薪","annal_salary")
.withColumnRenamed("id","uid")
.show()
+---+-------+---+------+-----+------------+
|uid| name|age|salary| sex|annal_salary|
+---+-------+---+------+-----+------------+
| 1|Michael| 29|2000.0|false| 24000.0|
| 2| Andy| 30|5000.0| true| 60000.0|
| 3| Justin| 19|1000.0| true| 12000.0|
| 4| Kaine| 20|5000.0|false| 60000.0|
| 5| Lisa| 19|1000.0|false| 12000.0|
+---+-------+---+------+-----+------------+
groupBy
和SQL中的 group by ⽤法⼀直,通常和⼀些聚合函数⼀起使⽤。
userDataFrame.select($"id",$"name",$"age",$"salary",$"sex")
.groupBy($"sex")
.mean("salary") //计算平均值等价avg
.show()
+-----+------------------+
| sex| avg(salary)|
+-----+------------------+
| true| 3000.0|
|false|2666.6666666666665|
+-----+------------------+
类似还有max、min、sum、avg算⼦,但是如果使⽤算⼦,后⾯跟⼀个聚合函数。⼀般来讲⽤户可以使⽤ agg 算⼦实现多个聚合操作。
agg
必须跟在groupBy后⾯,调⽤多个聚合函数,实现对某些字段的求和、最⼤值、最⼩值、平均值等。
import org.apache.spark.sql.functions._ //注意导包
userDataFrame.select($"id",$"name",$"age",$"salary",$"sex")
.groupBy($"sex")
.agg(sum("salary") as "sum", avg("salary") as "avg",max("salary") as
"max",min("salary") as "min")
.show()
+-----+------+------------------+------+------+
| sex| sum| avg| max| min|
+-----+------+------------------+------+------+
| true|6000.0| 3000.0|5000.0|1000.0|
|false|8000.0|2666.6666666666665|5000.0|1000.0|
+-----+------+------------------+------+------+
或者
userDataFrame.select($"id",$"name",$"age",$"salary",$"sex")
.groupBy($"sex")
.agg("salary"->"sum","salary"->"avg","salary"->"max","salary"->"min")
.show()
开窗函数
使⽤over完成开窗,操作。
import org.apache.spark.sql.functions._
val w = Window.partitionBy("sex")
.orderBy($"salary" desc)
.rowsBetween(Window.unboundedPreceding,Window.currentRow)
userDataFrame.select($"id",$"name",$"age",$"salary",$"sex")
.withColumn("salary_rank",dense_rank() over (w))
.show()
select id,name,...,dense_rank() over(partition by sex order by salary desc rows
between unbounded preceding and current row) from t_user
cube
实现多维度分析计算
import org.apache.spark.sql.functions._
spark.sparkContext.makeRDD(List((110,50,80),(120,60,95),(120,50,96)))
.toDF("height","weight","score")
.cube($"height",$"weight")
.agg(avg("score"),max("score"))
.show()
+------+------+-----------------+----------+
|height|weight| avg(score)|max(score)|
+------+------+-----------------+----------+
| 110| 50| 80.0| 80|
| 120| null| 95.5| 96|
| 120| 60| 95.0| 95|
| null| 60| 95.0| 95|
| null| null|90.33333333333333| 96|
| 120| 50| 96.0| 96|
| 110| null| 80.0| 80|
| null| 50| 88.0| 96|
+------+------+-----------------+----------+
pivot
该算⼦引⾃于SqlServer,主要⽤于实现⾏转列操作。
pivot(值一,值二)
值一 == 要进行行转列的字段
值二 == 行转列字段去重后的集合
case class UserCost(id:Int,category:String,cost:Double)
var userCostRDD=spark.sparkContext.parallelize(List(
UserCost(1,"电⼦类",100),
UserCost(1,"电⼦类",20),
UserCost(1,"⺟婴类",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"美⻝",79),
UserCost(2,"电⼦类",80),
UserCost(2,"⽣活⽤品",100)
))
var categories=userCostRDD.map(uc=>uc.category).distinct.collect()
userCostRDD.toDF("id","category","cost")
.groupBy("id")
.pivot($"category",categories)
.sum("cost")
.show()
+---+------+--------+------+----+
| id|⺟婴类|⽣活⽤品|电⼦类|美⻝|
+---+------+--------+------+----+
| 1| 100.0| 100.0| 120.0|null|
| 2| null| 100.0| 80.0|79.0|
+---+------+--------+------+----+
na
提供了对null值字段数据的⾃动填充技术。
case class User01(id:Int,name:String,sex:Boolean,age:Int,salary:Double)
case class UserCost(id:Int,category:String,cost:Double)
var userCostRDD=spark.sparkContext.parallelize(List(
UserCost(1,"电⼦类",100),
UserCost(1,"电⼦类",20),
UserCost(1,"⺟婴类",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"美⻝",79),
UserCost(2,"电⼦类",80),
UserCost(2,"⽣活⽤品",100)
))
var categories=userCostRDD.map(uc=>uc.category).distinct.collect()
userCostRDD.toDF("id","category","cost")
.groupBy("id")
.pivot($"category",categories)
.sum("cost")
.na.fill(0.0)
.show()
+---+------+--------+------+----+
| id|⺟婴类|⽣活⽤品|电⼦类|美⻝|
+---+------+--------+------+----+
| 1| 100.0| 100.0| 120.0| 0.0|
| 2| 0.0| 100.0| 80.0|79.0|
+---+------+--------+------+----+
其中fill表示填充。
var userCostRDD=spark.sparkContext.parallelize(List(
UserCost(1,"电⼦类",100),
UserCost(1,"电⼦类",20),
UserCost(1,"⺟婴类",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"美⻝",79),
UserCost(2,"电⼦类",80),
UserCost(2,"⽣活⽤品",100)
))
var categories=userCostRDD.map(uc=>uc.category).distinct.collect()
userCostRDD.toDF("id","category","cost")
.groupBy("id")
.pivot($"category",categories)
.sum("cost")
.na.fill(Map("美⻝"-> -1,"⺟婴类"-> 1000))
.show()
+---+------+--------+------+----+
| id|⺟婴类|⽣活⽤品|电⼦类|美⻝|
+---+------+--------+------+----+
| 1| 100.0| 100.0| 120.0|-1.0|
| 2|1000.0| 100.0| 80.0|79.0|
+---+------+--------+------+----+
⼀般na后⾯还可以跟drop算⼦,可以删除⼀些null值的⾏
var userCostRDD=spark.sparkContext.parallelize(List(
UserCost(1,"电⼦类",100),
UserCost(1,"电⼦类",20),
UserCost(1,"⺟婴类",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"美⻝",79),
UserCost(2,"电⼦类",80),
UserCost(2,"⽣活⽤品",100)
))
var categories=userCostRDD.map(uc=>uc.category).distinct.collect()
userCostRDD.toDF("id","category","cost")
.groupBy("id")
.pivot($"category",categories)
.sum("cost")
//.na.drop(4)//如果少于四个⾮空,删除 要求非空字段大于等于4
// .na.drop("any")//只要有⼀个为null,就删除,`all`都为null才删除
.na.drop(List("美⻝","⺟婴类"))//如果指定列出null、删除
.show()
join
和数据的join类似。
case class User01(id:Int,name:String,sex:Boolean,age:Int,salary:Double)
case class UserCost(id:Int,category:String,cost:Double)
var userCostRDD=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"⺟婴⽤品",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"居家美⻝",79),
UserCost(2,"消费电⼦",80),
UserCost(2,"⽣活⽤品",100)
))
var userRDD=spark.sparkContext.parallelize(List(
User01(1,"张晓三",true,18,15000),
User01(2,"李晓四",true,18,18000),
User01(3,"王晓五",false,18,10000)
))
val categories = userCostRDD.map(_.category).distinct().collect()
userCostRDD.toDF("id","category","cost").groupBy("id")
.pivot($"category",categories)
.sum("cost")
.join(userRDD.toDF("id","name","sex","age","salary"),"id")
.na.fill(0.0)
.show()
//外连接
userCostRDD.toDF("id","category","cost").as("c").groupBy("id")
.pivot($"category",categories)
.sum("cost")
.join(userRDD.toDF("id","name","sex","age","salary").as("u"),$"c.id"===$"u.id",
"LEFT_OUTER")
.na.fill(0.0)
.show()
dropDuplicates
删除记录中的重复记录,类似sql中distinct关键字
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"母婴用品",100),
UserCost(1,"生活用品",100),
UserCost(2,"居家美食",79),
UserCost(2,"消费电子",80),
UserCost(2,"生活用品",100)
)).toDF().as("t_user_cost")
//userCostDF.dropDuplicates().show() 必须所有字段出现重复,才会对重复记录去重。
userCostDF.dropDuplicates("category").show() //例如针对于category去重
+---+--------+-----+
| id|category| cost|
+---+--------+-----+
| 2|居家美⻝| 79.0|
| 1|⺟婴⽤品|100.0|
| 1|⽣活⽤品|100.0|
| 2|消费电⼦| 80.0|
| 1|电脑配件|100.0|
+---+--------+-----+
drop
删除指定列信息
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"⺟婴⽤品",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"居家美⻝",79),
UserCost(2,"消费电⼦",80),
UserCost(2,"⽣活⽤品",100)
)).toDF()
userCostDF.drop("cost","id").dropDuplicates().show() //删除cost和id列,再对剩下的category列去重
orderBy
类似SQL中的orderBy⽤于指定排序字段
val df=spark.sparkContext.parallelize(List((1,"TV,GAME"),(2,"SLEEP,FOOTBALL"))).toDF("id","hobbies")
df.orderBy($"id" asc)
.show()
+---+--------------+
| id| hobbies|
+---+--------------+
| 1| TV,GAME|
| 2|SLEEP,FOOTBALL|
+---+--------------+
limit
类似于数据库的分⻚语句,但是只能限定条数,类似RDD中take(n) ----只能取前几个,不能取中间
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"⺟婴⽤品",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"居家美⻝",79),
UserCost(2,"消费电⼦",80),
UserCost(2,"⽣活⽤品",100)
)).toDF()
userCostDF.orderBy($"id" asc).limit(3).show()
+---+--------+-----+
| id|category| cost|
+---+--------+-----+
| 1|电脑配件|100.0|
| 1|⺟婴⽤品|100.0|
| 1|⽣活⽤品|100.0|
+---+--------+-----+
filter
类似于where,过滤掉⼀些不符合要求的数据集,⽤户可以给表达式、过滤条件、或者是函数
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"⺟婴⽤品",100),
UserCost(1,"⽣活⽤品",100),
UserCost(2,"居家美⻝",79),
UserCost(2,"消费电⼦",80),
UserCost(2,"⽣活⽤品",100)
)).toDF()
userCostDF.orderBy($"id" asc).limit(3).filter("category != '⺟婴⽤品' ").show()
//userCostDF.orderBy($"id" asc).limit(3).filter($"category" =!="⺟婴⽤品").show() 等价一
//userCostDF.orderBy($"id" asc).limit(3).filter(row=> ! row.getAs[String]("category").equals("⺟婴⽤品")).show() 等价二
+---+--------+-----+
| id|category| cost|
+---+--------+-----+
| 1|电脑配件|100.0|
| 1|⽣活⽤品|100.0|
+---+--------+-----+
map
类似RDD中map、处理DataFrame中的Row类型。
var userDF=spark.sparkContext.parallelize(List(
User01(1,"张晓三",true,18,15000),
User01(2,"李晓四",true,18,18000),
User01(3,"王晓五",false,18,10000)
)).toDF()
val dataset:Dataset[(Int,String,Double)] = userDF.map(row => (row.getAs[Int]("id"),
row.getAs[String]("name"), row.getAs[Double]("salary")))
dataset.toDF("id","name","salary")
.show()
+---+------+-------+
| _1| _2| _3|
+---+------+-------+
| 1|张晓三|15000.0|
| 2|李晓四|18000.0|
| 3|王晓五|10000.0|
+---+------+-------+
rdd
可以将Dataset[T]或者DataFrame类型的数据变成RDD[T]或者是RDD[Row]
var userDF=spark.sparkContext.parallelize(List(
User01(1,"张晓三",true,18,15000),
User01(2,"李晓四",true,18,18000),
User01(3,"王晓五",false,18,10000)
)).toDF()
val dataset:Dataset[(Int,String,Double)] = userDF.map(row => (row.getAs[Int]("id"),
row.getAs[String]("name"), row.getAs[Double]("salary")))
dataset.rdd.foreach(t=>println(t._1+" "+t._2+" "+t._3))
Dataset/DataFrame SQL操作
数据准备
t_user
Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2
t_dept
1,研发
2,设计
3,产品
将以上数据上传到HDFS⽂件系统
构建SparkSession
//1.创建SparkSession
val spark = SparkSession.builder()
.appName("hellosql")
.master("local[10]")
.getOrCreate()
//引⼊改隐试转换 主要是 将 集合、RDD 转换为 DataFrame/Dataset
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDF = spark.sparkContext.textFile("hdfs://centos:9000/demo/user")
.map(line => line.split(","))
.map(ts => User(ts(0), ts(1).toInt, ts(2).toDouble, ts(3).toBoolean, ts(4),
ts(5).toInt))
.toDF()
val deptDF = spark.sparkContext.textFile("hdfs://centos:9000/demo/dept")
.map(line => line.split(","))
.map(ts => Dept(ts(0).toInt,ts(1)))
.toDF()
userDF.show()
deptDF.show()
//关闭SparkSession
spark.close()
注册视图:方便使用SQL查询语句
userDF.createOrReplaceTempView("t_user")
deptDF.createOrReplaceTempView("t_dept")
执⾏SQL
var sql=
"""
select *, salary * 12 as annual_salary from t_user
"""
spark.sql(sql).show()
单表查询
select *, salary * 12 as annual_salary from t_user
+-------+---+-------+-----+--------+------+-------------+
| name|age| salary| sex| job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER| 1| 240000.0|
| Andy| 30|15000.0| true|SALESMAN| 1| 180000.0|
| Justin| 19| 8000.0| true| CLERK| 1| 96000.0|
| Kaine| 20|20000.0| true| MANAGER| 2| 240000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 216000.0|
+-------+---+-------+-----+--------+------+-------------+
like模糊
select *, salary * 12 as annual_salary from t_user where name like '%a%'
+-------+---+-------+-----+--------+------+-------------+
| name|age| salary| sex| job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER| 1| 240000.0|
| Kaine| 20|20000.0| true| MANAGER| 2| 240000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 216000.0|
+-------+---+-------+-----+--------+------+-------------+
排序查询
select * from t_user order by deptNo asc,salary desc
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
limit查询
只能查询前 n 个数据,无法从中间截取
select * from t_user order by deptNo asc,salary desc limit 3
+-------+---+-------+----+--------+------+
| name|age| salary| sex| job|deptNo|
+-------+---+-------+----+--------+------+
|Michael| 29|20000.0|true| MANAGER| 1|
| Andy| 30|15000.0|true|SALESMAN| 1|
| Justin| 19| 8000.0|true| CLERK| 1|
+-------+---+-------+----+--------+------+
分组查询
select deptNo,avg(salary) avg from t_user group by deptNo
+------+------------------+
|deptNo| avg|
+------+------------------+
| 1|14333.333333333334|
| 2| 19000.0|
+------+------------------+
Having过滤
select deptNo,avg(salary) avg from t_user group by deptNo having avg > 15000
+------+-------+
|deptNo| avg|
+------+-------+
| 2|19000.0|
+------+-------+
case-when
select deptNo,name,salary,sex,
(case sex when true then '男' else '⼥' end ) as user_sex,
(case when salary >= 20000 then '⾼' when salary >= 15000 then '中' else '低' end ) as level
from t_user
+------+-------+-------+-----+--------+-----+
|deptNo| name| salary| sex|user_sex|level|
+------+-------+-------+-----+--------+-----+
| 1|Michael|20000.0| true| 男| ⾼|
| 1| Andy|15000.0| true| 男| 中|
| 1| Justin| 8000.0| true| 男| 低|
| 2| Kaine|20000.0| true| 男| ⾼|
| 2| Lisa|18000.0|false| ⼥| 中|
+------+-------+-------+-----+--------+-----+
注意:别名不能是中文
⾏专列
val coursedf = spark.sparkContext.parallelize(List(
(1, "语⽂", 100),
(1, "数学", 100),
(1, "英语", 100),
(2, "数学", 79),
(2, "语⽂", 80),
(2, "英语", 100)
)).toDF("id","course","score")
coursedf.createOrReplaceTempView("t_course")
//第一步
select id,
(case when course == '语文' then score else 0 end) Chanese,
(case when course == '数学' then score else 0 end) Math,
(case when course == '英语' then score else 0 end) Englist
from t_course
//第二步
select id,
sum(case course when '语文' then score else 0 end) as chinese,
sum(case course when '数学' then score else 0 end) as math,
sum(case course when '英语' then score else 0 end) as english
from t_course
group by id
#解析
先去掉,sum函数和groupBy函数,单纯的查找id,chinese,math,english这些字段,再进行分组和求最大值或和
原表
+---+------+-----+
| id|course|score|
+---+------+-----+
| 1| 语⽂| 100|
| 1| 数学| 100|
| 1| 英语| 100|
| 2| 数学| 79|
| 2| 语文| 80|
| 2| 英语| 100|
+---+------+-----+
第一步结果
+---+-------+----+-------+
| id|Chanese|Math|Englist|
+---+-------+----+-------+
| 1| 100| 0| 0|
| 1| 0| 100| 0|
| 1| 0| 0| 100|
| 2| 0| 79| 0|
| 2| 80| 0| 0|
| 2| 0| 0| 100|
+---+-------+----+-------+
第二步结果
+---+-------+----+-------+
| id|chinese|math|english|
+---+-------+----+-------+
| 1| 100| 100| 100|
| 2| 80| 79| 100|
+---+-------+----+-------+
使⽤pivot实现⾏转列
select * from t_course pivot(max(score) for course in ('数学','语⽂','英语'))
表连接
select u.*,d.dname from t_user u left join t_dept d on u.deptNo=d.deptNo
+-------+---+-------+-----+--------+------+-----+
| name|age| salary| sex| job|deptNo|dname|
+-------+---+-------+-----+--------+------+-----+
|Michael| 29|20000.0| true| MANAGER| 1| 研发|
| Andy| 30|15000.0| true|SALESMAN| 1| 研发|
| Justin| 19| 8000.0| true| CLERK| 1| 研发|
| Kaine| 20|20000.0| true| MANAGER| 2| 设计|
| Lisa| 19|18000.0|false|SALESMAN| 2| 设计|
+-------+---+-------+-----+--------+------+-----+
⼦查询
select *,(select sum(t1.salary) from t_user t1 where (t1.deptNo = t2.deptNo) group by t1.deptNo) as total from t_user t2 left join t_dept d on t2.deptNo=d.deptNo order by t2.deptNo asc,t2.salary desc
+-------+---+-------+-----+--------+------+------+-----+-------+
| name|age| salary| sex| job|deptNo|deptNo|dname| total|
+-------+---+-------+-----+--------+------+------+-----+-------+
|Michael| 29|20000.0| true| MANAGER| 1| 1| 研发|43000.0|
| Andy| 30|15000.0| true|SALESMAN| 1| 1| 研发|43000.0|
| Justin| 19| 8000.0| true| CLERK| 1| 1| 研发|43000.0|
| Kaine| 20|20000.0| true| MANAGER| 2| 2| 设计|38000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2| 设计|38000.0|
+-------+---+-------+-----+--------+------+------+-----+-------+
开窗函数
select *,rank() over(partition by t2.deptNo order by t2.salary desc) as rank from t_user t2 left join t_dept d on t2.deptNo=d.deptNo
+-------+---+-------+-----+--------+------+------+-----+----+
| name|age| salary| sex| job|deptNo|deptNo|dname|rank|
+-------+---+-------+-----+--------+------+------+-----+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1| 研发| 1|
| Andy| 30|15000.0| true|SALESMAN| 1| 1| 研发| 2|
| Justin| 19| 8000.0| true| CLERK| 1| 1| 研发| 3|
| Kaine| 20|20000.0| true| MANAGER| 2| 2| 设计| 1|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2| 设计| 2|
+-------+---+-------+-----+--------+------+------+-----+----+
cube分析
select deptNo,job,max(salary),avg(salary) from t_user group by deptNo,job with cube
+------+--------+-----------+------------------+
|deptNo| job|max(salary)| avg(salary)|
+------+--------+-----------+------------------+
| 1|SALESMAN| 15000.0| 15000.0|
| 1| null| 20000.0|14333.333333333334|
| null| null| 20000.0| 16200.0|
| null|SALESMAN| 18000.0| 16500.0|
| 1| CLERK| 8000.0| 8000.0|
| 2| MANAGER| 20000.0| 20000.0|
| 2| null| 20000.0| 19000.0|
| null| MANAGER| 20000.0| 20000.0|
| null| CLERK| 8000.0| 8000.0|
| 2|SALESMAN| 18000.0| 18000.0|
| 1| MANAGER| 20000.0| 20000.0|
+------+--------+-----------+------------------+
等价写法:
select deptNo,job,max(salary),avg(salary) from t_user group by cube(deptNo,job)
⾃定义函数
spark内置很多函数都定义在 org.apache.spark.sql.functions 单例对象中,如果不满⾜实际需求,⼤家可以考虑对Spark函数库进⾏扩展。
单⾏函数
1、定义函数
val sexFunction=(sex:Boolean)=> sex match {
case true => "男"
case false => "⼥"
case default => "未知"
}
val commFunction=(age:Int,salary:Double)=> {
if(age>=30){
salary+500
}else{
salary
}
}
2、注册⽤户的函数
spark.udf.register("sexFunction",sexFunction)
spark.udf.register("commFunction",commFunction)
3、测试使⽤函数
select name,sexFunction(sex),age,salary,job,commFunction(age,salary) as comm from t_user
+-------+--------------------+---+-------+--------+-------+
| name|UDF:sexFunction(sex)|age| salary| job| comm|
+-------+--------------------+---+-------+--------+-------+
|Michael| 男| 29|20000.0| MANAGER|20000.0|
| Andy| 男| 30|15000.0|SALESMAN|15500.0|
| Justin| 男| 19| 8000.0| CLERK| 8000.0|
| Kaine| 男| 20|20000.0| MANAGER|20000.0|
| Lisa| ⼥| 19|18000.0|SALESMAN|18000.0|
+-------+--------------------+---+-------+--------+-------+
聚合函数–了解
①Untyped User-Defined Aggregate Functions -了解
1、定义聚合函数
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer,
UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, IntegerType, StructField,
StructType}
object CustomUserDefinedAggregateFunction extends UserDefinedAggregateFunction{
//接收数据类型是什么
override def inputSchema: StructType = {
StructType(StructField("inputColumn", DoubleType) :: Nil)
}
//⽤于作为缓冲中间结果类型
override def bufferSchema: StructType = {
StructType(StructField("count", IntegerType) ::StructField("total", DoubleType)::
Nil)
}
//最终返回值类型
override def dataType: DataType = DoubleType
//表示函数输出结果类型是否⼀致
override def deterministic: Boolean = true
//设置聚合初始化状态
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer(0)=0 //总计数
buffer(1)=0.0 //总和
}
//将row中结果累加到buffer中
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
var historyCount = buffer.getInt(0)
var historyTotal = buffer.getDouble(1)
if(!input.isNullAt(0)){
historyTotal += input.getDouble(0)
historyCount += 1
buffer(0)= historyCount
buffer(1) = historyTotal
}
}
//做最终汇总操作
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1(0)=buffer1.getInt(0) + buffer2.getInt(0)
buffer1(1)=buffer1.getDouble(1) + buffer2.getDouble(1)
}
//计算最终结果
override def evaluate(buffer: Row): Any = {
buffer.getDouble(1) / buffer.getInt(0)
}
}
2、注册聚合函数
spark.udf.register("custom_avg",CustomUserDefinedAggregateFunction)
3、测试使⽤聚合函数
select deptNo,custom_avg(salary) from t_user group by deptNo
+------+-------------------------------------------+
|deptNo|customuserdefinedaggregatefunction$(salary)|
+------+-------------------------------------------+
| 1| 14333.333333333334|
| 2| 19000.0|
+------+-------------------------------------------+
Type-Safe
1、定义聚合函数
object CustomAggregator extends Aggregator[GenericRowWithSchema,Average,Double]{
//初始化值
override def zero: Average = Average(0.0,0)
//计算局部结果
override def reduce(b: Average, a: GenericRowWithSchema): Average = {
Average(b.total+a.getAs[Double]("salary"),b.count+1)
}
//将局部结果合并
override def merge(b1: Average, b2: Average): Average = {
Average(b1.total+b2.total,b1.count+b2.count)
}
//计算总结果
override def finish(reduction: Average): Double = {
reduction.total/reduction.count
}
//指定中间结果类型的Encoders
override def bufferEncoder: Encoder[Average] = Encoders.product[Average]
//指定最终结果类型的Encoders
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
2、注册声明聚合函数
val averageSalary = CustomAggregator.toColumn.name("average_salary")
3、测试使⽤聚合
import org.apache.spark.sql.functions._
userDF.select("deptNo","salary")
.groupBy("deptNo")
.agg(averageSalary,avg("salary"))
.show()
+------+------------------+------------------+
|deptNo| average_salary| avg(salary)|
+------+------------------+------------------+
| 1|14333.333333333334|14333.333333333334|
| 2| 19000.0| 19000.0|
+------+------------------+------------------+
Load & save
parquet⽂件
Parquet仅仅是⼀种存储格式,它是语⾔、平台⽆关的,并且不需要和任何⼀种数据处理框架绑定.(spark使用)
save
var sql=
"""
select * from t_user
"""
val result: DataFrame = spark.sql(sql)
result.write.save("hdfs://CentOS:9000/results/parquet")
load
val dataFrame = spark.read.load("hdfs://CentOS:9000/results/parquet")
dataFrame.printSchema()
dataFrame.show()
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- salary: double (nullable = true)
|-- sex: boolean (nullable = true)
|-- job: string (nullable = true)
|-- deptNo: integer (nullable = true)
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
等价写法: spark.read.parquet(“hdfs://CentOS:9000/results/parquet”)
Json格式
save
var sql=
"""
select * from t_user
"""
val result: DataFrame = spark.sql(sql)
result.write
.format("json")
.mode(SaveMode.Overwrite) //append errorifexist ignore 几种添加方式
.save("hdfs://CentOS:9000/results/json")
load
val dataFrame = spark.read
.format("json")
.load("hdfs://CentOS:9000/results/json")
dataFrame.printSchema()
dataFrame.show()
⽤户也可以j简单写 spark.read.json(“hdfs://CentOS:9000/results/json”)
csv格式
save
var sql=
"""
select * from t_user
"""
val result: DataFrame = spark.sql(sql)
result.write
.format("csv")
.mode(SaveMode.Overwrite) //设置存储的方式
.option("sep", ",")//指定分隔符
.option("inferSchema", "true")//参照表schema信息
.option("header", "true")//是否产⽣表头信息
.save("hdfs://CentOS:9000/results/csv")
load
val dataFrame = spark.read
.format("csv")
.option("sep", ",")//指定分隔符
.option("inferSchema", "true")//参照表schema信息
.option("header", "true")//是否产⽣表头信息
.load("hdfs://CentOS:9000/results/csv")
root
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
|-- salary: double (nullable = true)
|-- sex: boolean (nullable = true)
|-- job: string (nullable = true)
|-- deptNo: integer (nullable = true)
+-------+---+-------+-----+--------+------+
| name|age| salary| sex| job|deptNo|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER| 1|
| Andy| 30|15000.0| true|SALESMAN| 1|
| Justin| 19| 8000.0| true| CLERK| 1|
| Kaine| 20|20000.0| true| MANAGER| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2|
+-------+---+-------+-----+--------+------+
ORC格式
ORC的全称是(Optimized Row Columnar),ORC⽂件格式是⼀种Hadoop⽣态圈中的列式存储格式,它的产⽣早在2013年初,最初产⽣⾃Apache Hive,⽤于降低Hadoop数据存储空间和加速Hive查询速度。
save
var sql=
"""
select * from t_user
"""
val result: DataFrame = spark.sql(sql)
result.write
.format("orc")
.mode(SaveMode.Overwrite)
.save("hdfs://CentOS:9000/results/orc")
load
val dataFrame = spark.read
.orc("hdfs://CentOS:9000/results/orc")
dataFrame.printSchema()
dataFrame.show()
SQL读取⽂件
val parqeutDF = spark.sql("SELECT * FROM parquet.`hdfs://CentOS:9000/results/parquet`")
val jsonDF = spark.sql("SELECT * FROM json.`hdfs://CentOS:9000/results/json`")
val orcDF = spark.sql("SELECT * FROM orc.`hdfs://CentOS:9000/results/orc/`")
//val csvDF = spark.sql("SELECT * FROM csv.`hdfs://CentOS:9000/results/csv/`")
//一般不读取csv类型的数据,无法控制表头信息
parqeutDF.show()
jsonDF.show()
orcDF.show()
// csvDF.show()
JDBC数据读取
导入依赖
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.48</version>
</dependency>
load
val dataFrame = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://CentOS:3306/test")
.option("dbtable", "t_user")
.option("user", "root")
.option("password", "root")
.load()
dataFrame.show()
+---+--------+-----+---+----------+
| id| name| sex|age| birthDay|
+---+--------+-----+---+----------+
| 0|zhangsan| true| 20|2020-01-11|
| 1| lisi|false| 25|2020-01-10|
| 3| wangwu| true| 36|2020-01-17|
| 4| zhao6|false| 50|1990-02-08|
| 5| win7| true| 20|1991-02-08|
| 6| win8|false| 28|2000-01-01|
+---+--------+-----+---+----------+
save
val props = new Properties()
props.put("user", "root")
props.put("password", "root")
result .write
.mode(SaveMode.Overwrite)
.jdbc("jdbc:mysql://CentOS:3306/test","t_user",props)
系统会⾃动创建t_user表
Spark & Hive集成
多用于老系统使用hive,数据存储在hive中,可以使用spark直接读取hive中的数据进行计算
修改hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://CentOS:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>root</value>
</property>
<!--开启MetaStore服务,⽤于Spark读取hive中的元数据-->
<property>
<name>hive.metastore.uris</name>
<value>thrift://CentOS:9083</value>
</property>
启动metastore服务
[root@CentOS apache-hive-1.2.2-bin]# ./bin/hive --service metastore >/dev/null 2>&1 &
[1] 55017
导⼊以下依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.5</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.5</version>
</dependency>
编写如下代码
//配置spark
val spark = SparkSession.builder()
.appName("Spark Hive Example")
.master("local[*]")
.config("hive.metastore.uris", "thrift://CentOS:9083")
.enableHiveSupport() //启动hive⽀持
.getOrCreate()
spark.sql("show databases").show()
spark.sql("use baizhi")
spark.sql("select * from t_emp").na.fill(0.0).show()
spark.close()
+-----+------+---------+----+-------------------+-------+-------+------+
|empno| ename| job| mgr| hiredate| sal| comm|deptno|
+-----+------+---------+----+-------------------+-------+-------+------+
| 7369| SMITH| CLERK|7902|1980-12-17 00:00:00| 800.00| 0.00| 20|
| 7499| ALLEN| SALESMAN|7698|1981-02-20 00:00:00|1600.00| 300.00| 30|
| 7521| WARD| SALESMAN|7698|1981-02-22 00:00:00|1250.00| 500.00| 30|
| 7566| JONES| MANAGER|7839|1981-04-02 00:00:00|2975.00| 0.00| 20|
| 7654|MARTIN| SALESMAN|7698|1981-09-28 00:00:00|1250.00|1400.00| 30|
| 7698| BLAKE| MANAGER|7839|1981-05-01 00:00:00|2850.00| 0.00| 30|
| 7782| CLARK| MANAGER|7839|1981-06-09 00:00:00|2450.00| 0.00| 10|
| 7788| SCOTT| ANALYST|7566|1987-04-19 00:00:00|1500.00| 0.00| 20|
| 7839| KING|PRESIDENT| 0|1981-11-17 00:00:00|5000.00| 0.00| 10|
| 7844|TURNER| SALESMAN|7698|1981-09-08 00:00:00|1500.00| 0.00| 30|
| 7876| ADAMS| CLERK|7788|1987-05-23 00:00:00|1100.00| 0.00| 20|
| 7900| JAMES| CLERK|7698|1981-12-03 00:00:00| 950.00| 0.00| 30|
| 7902| FORD| ANALYST|7566|1981-12-03 00:00:00|3000.00| 0.00| 20|
| 7934|MILLER| CLERK|7782|1982-01-23 00:00:00|1300.00| 0.00| 10|
+-----+------+---------+----+-------------------+-------+-------+------+