``--  https://cloud.tencent.com/developer/article/1475487``

http://archive.ics.uci.edu/ml/datasets/Iris

1、数据导入

``````def createDFByCSV(spark:SparkSession) = {
.option("inferSchema",true.toString)//这是自动推断属性列的数据类型。

df.show()
}``````

2、使用Spark SQL计算统计值
2.1 最大值、最小值

``````import spark.implicits._

df.agg(max(\$"feature1") as "max_feature1",
min(\$"feature2") as "min_feature2")
.show()``````

``````df.agg(max(col("feature1")) as "max_feature1",
min(col("feature2")) as "min_feature2")
.show()``````

1.2 平均值

``````df.agg(mean(\$"feature1") as "mean_feature1",
mean(\$"feature2") as "mean_feature2").show()``````

1.3 样本标准差&总体标准差

``````df.agg(stddev(\$"feature1") as "stddev_feature1",
stddev_pop(\$"feature1") as "stddev_pop_feature1",
stddev_samp(\$"feature1") as "stddev_samp_feature1").show()``````

1.4 中位数
SparkSQL中也没有直接计算中位数的方法，所以我们还是借鉴上一篇中的思路，再来回顾一下：

1）假设n = 149 ，(n+1)/2 = 75 ，小数部分为0，那么中位数=75位置的数 （1 - 0）+ 76位置的数 （0 - 0）
2）假设n = 150，(n+1)/2 = 75，小数部分为0.5,那么中位数=75位置的数 （1 - 0.5）+ 76位置的数 （0.5 - 0）

``````val windowFun = Window.orderBy(col("feature3").asc)
df.withColumn("rank",row_number().over(windowFun)).show(false)``````

``````val median_index = df.agg(
((count(\$"feature3") + 1) / 2).cast("int") as "rank",
((count(\$"feature3") + 1) / 2 %  1) as "float_part"
)

median_index.show()``````

``````val windowFun = Window.orderBy(col("feature3").asc)

``````val median_index = df.agg(
((count(\$"feature3") + 1) / 2).cast("int") as "rank",
((count(\$"feature3") + 1) / 2 %  1) as "float_part"
)

val windowFun = Window.orderBy(col("feature3").asc)

df.withColumn("rank",row_number().over(windowFun))
.join(median_index,Seq("rank"),"inner")
.withColumn("median" ,(\$"float_part" - lit(0)) * \$"next_feature3" + (lit(1) - \$"float_part") * \$"feature3")
.show()``````

1.5 四分位数

Q1的位置= (n+1) × 0.25
Q2的位置= (n+1) × 0.5
Q3的位置= (n+1) × 0.75

Q1的位置=1+（n-1）x 0.25
Q2的位置=1+（n-1）x 0.5
Q3的位置=1+（n-1）x 0.75

``````val q1_index = df.agg(
((count(\$"feature3") + 1) * 0.25).cast("int") as "rank",
((count(\$"feature3") + 1) * 0.25 %  1) as "float_part"
)

val windowFun = Window.orderBy(col("feature3").asc)

df.withColumn("rank",row_number().over(windowFun))
.join(q1_index,Seq("rank"),"inner")
.withColumn("q1" ,(\$"float_part" - lit(0)) * \$"next_feature3" + (lit(1) - \$"float_part") * \$"feature3")
.show()``````

``````val q1_index = df.agg(
((count(\$"feature3") - 1) * 0.25).cast("int") + 1 as "rank",
((count(\$"feature3") - 1) * 0.25 %  1) as "float_part"
)

val windowFun = Window.orderBy(col("feature3").asc)

df.withColumn("rank",row_number().over(windowFun))
.join(q1_index,Seq("rank"),"inner")
.withColumn("q1" ,(\$"float_part" - lit(0)) * \$"next_feature3" + (lit(1) - \$"float_part") * \$"feature3")
.show()``````

3、踩坑总结