Spark SQL应用

实验目的

深入理解和掌握DataFrame各种常见操作和编程方法;掌握使用Spark SQL编程解决实际问题的方法。

实验要求

  1. 掌握基于Maven的Scala和Spark SQL编程环境配置;
  2. 掌握DataFrame查询方法。

实验内容

  1. 将实验二中的Online Retail.csv上传至HDFS
  2. 在Maven中配置Spark SQL编程环境,pom.xml中添加:
<dependency>
	<groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.12</artifactId>
    <version>3.0.1</version>
</dependency>
若自己安装的Spark不是3.0.1,则自己搜索适合的spark-sql版本
  1. 编写代码将csv文件读取至DataFrame,将Schema设置如下:
    表1 Schema设置
Column Name	Type	Nullable
InvoiceNo	StringType	false
StockCode	StringType	false
Description	StringType	false
Quantity	IntegerType	false
InvoiceDate	DateType	false
UnitPrice	DecimalType	false
CustomerID	StringType	false
Country	StringType	false
val conf = new SparkConf().setAppName(“Spark SQL”).setMaster(“local[*]”)
val sc = new SparkContext(conf)
val spark = SparkSession.builder().master(“local[*]”).appName(“Spark SQL”).getOrCreate()
val df = spark.read.format(“com.databricks.spark.csv”)
.option(“header”, “true”)
.option(“mode”, “DROPMALFORMED”)
.load(“hdfs://主机名或ip地址:端口号/文件路径”)
val df_null = df.na.drop()
val intRegx = “^\d+$”.r
val timeRegx = “^\d{1,2}/\d{1,2}/\d{4} \d{1,2}:\d{1,2}$”.r
val doubleRegx = “^\d+(\.\d+)?$”.r
val timeFormat = new SimpleDateFormat(“M/d/yyyy H:m”)
val rdd = df_null.rdd.map(x => (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4), x.getString(5), x.getString(6), x.getString(7)))
.filter(x => !intRegx.findFirstIn(x._4).isEmpty &&
!timeRegx.findFirstIn(x._5).isEmpty && //以日期规则过滤InvoiceDate列  
!doubleRegx.findFirstIn(x._6).isEmpty) //以浮点型规则过滤UnitPrice列  
.map(x => (x._1, x._2, x._3, x._4.toInt, new java.sql.Date(timeFormat.parse(x._5).getTime), x._6.toDouble, x._7, x._8))
val schema = StructType(Array(
StructField(“InvoiceNo”, StringType, false),
StructField(“StockCode”, StringType, false),
StructField(“Description”, StringType, false),
StructField(“Quantity”, IntegerType, false),
StructField(“InvoiceDate”, DateType, false),
StructField(“UnitPrice”, DoubleType, false),
StructField(“CustomerID”, StringType, false),
StructField(“Country”, StringType, false)
))
val df_final = spark.createDataFrame(rdd.map(x => Row.fromTuple(x)), schema)
df_final.write.json(“hdfs://主机名或ip地址:端口号/文件路径”)

spark sql判断字段为空 spark sql null_大数据

(1) 要求:所有字段不能为空,可将所有包含空值行或无法做数据类型转换的行视为无效行去掉。

val rdd = df_null.rdd.map(x => (x.getString(0), x.getString(1), x.getString(2), x.getString(3), x.getString(4), x.getString(5), x.getString(6), x.getString(7)))
.filter(x => !intRegx.findFirstIn(x._4).isEmpty &&  
 !timeRegx.findFirstIn(x._5).isEmpty && //以日期规则过滤InvoiceDate列  
 !doubleRegx.findFirstIn(x._6).isEmpty) //以浮点型规则过滤UnitPrice列  
 .map(x => (x._1, x._2, x._3, x._4.toInt, new java.sql.Date(timeFormat.parse(x._5).getTime), x._6.toDouble, x._7, x._8))

(2) 读取文件方法:使用Spark SQL读取csv的方法,直接按列读取为DataFrame,然后再取RDD;或直接将csv以文件的方式读如RDD,然后以逗号将各列split为数组,但注意字段中本身包含逗号以及字段两端有双引号的情况,可用正则表达式识别各列内容

(3) 数据转换方法:使用RDD做数据转换并去除空行,然后创建Schema,将RDD按创建的Schema转为DataFrame

val df_final = spark.createDataFrame(rdd.map(x => Row.fromTuple(x)), schema)
  1. 在程序中将转换好的DataFrame存储为JSON
  2. 进入spark-shell
    (1) 将第4步保存的文件载入到DataFrame
val df = spark.read.json("hdfs://主机名或ip地址:端口号/文件名")

(2) 执行DataFrame中的查询(以下查询分别用转换操作算子和SQL语句实现),并用show命令打印出摘要信息
① 查询单价小于0.2的所有商品
先创建一个视图,用作sql语句的查询

val table = df.createTempView("data")

Sql

spark.sql("select Description, UnitPrice from data where UnitPrice<0.2").show()

算子

df.selectExpr("Description", "UnitPrice").where("UnitPrice<0.2").show() //单价小于0.2

spark sql判断字段为空 spark sql null_hdfs_02

② 查询订单551845~551850的顾客
Sql

spark.sql("select CustomerID, InvoiceNo from data where InvoiceNo>=55184 and InvoiceNo<=551850").show()

算子

df.selectExpr("CustomerID", "InvoiceNo").where("InvoiceNo>=55184 and InvoiceNo<=551850").show() //查询订单551845~551850的顾客

spark sql判断字段为空 spark sql null_hdfs_03

③ 统计本数据中包含了多少个订单
Sql

spark.sql("select count(distinct InvoiceNo) from data").show()

算子

df.select(countDistinct("InvoiceNo")).show()

spark sql判断字段为空 spark sql null_hdfs_04

④ 统计所有订单的最大金额、订单包含的最多产品数量、订单包含的最多产品种类
订单最大金额
Sql

spark.sql("select sum(Quantity*UnitPrice) as sumPrice from data group by InvoiceNo order by sum(Quantity*UnitPrice) desc").show()

算子

df.selectExpr("max(Quantity*UnitPrice)").show()

spark sql判断字段为空 spark sql null_大数据_05

最多产品数量
Sql

spark.sql("select StockCode, max(Quantity) as maxNum from data group by StockCode order by max(Quantity) desc").show()

算子

df.groupBy("StockCode").agg(sum("Quantity") as "maxNum").orderBy(desc("maxNum")).show()

spark sql判断字段为空 spark sql null_spark sql判断字段为空_06

最多产品总类
Sql

spark.sql("select StockCode, count(StockCode) from data group by StockCode order by count(StockCode) desc").show()

算子

df.groupBy("StockCode").count().orderBy(desc("count")).show()

spark sql判断字段为空 spark sql null_spark sql判断字段为空_07