Spark的专门数据结构是RDD,即分布式数据集的抽象,spark引擎的底层抽象,spark生态系统中其他组件的实现基础,但是,他无元信息,使得rdd程序不易理解,不优雅,需要自己优化程序。为了减少优化程序带来的劳动力,这里引入Spark Sql。Spark Sql的编程抽象是dataframe,构建在spark core 上,为RDD提供元信息,使得分布式计算引擎有更多机会自动优化程序。文中只是部分内容介绍,后期会逐步细化内容。

Spark SQL && Code

Spark Sql支持:
    sql,类sql(hiveql)查询;支持sql执行分布式数据操作或大规模分析
    从已存在的hive安装读取数据;
    具备集成关系过程和函数过程能力:实现、优化、复杂的逻辑,在分布式计算设置上进行扩展;
    dataframes api能够重点关注程序要做什么;
    catalyst优化器是spark sql 和 dataframe 的支点:推断各种数据格式的模式,内置兼容器,控制优化代码生成。

-- 前提是已经搭建并启动集群、spark。
-- 创建spark环境
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("spark://master:7077").appName("df_demo2").getOrCreate()

-- 导入py模块
from pyspark.sql.functions import *
from pyspark.sql.types import *
   
--创建dataframe :1        
-- 将元组转为多列DataFrame
kvDF = spark.createDataFrame([(1,2),(3,4)], schema=["key","value"])

# kvDF.show()

# 要在一个DataFrame中显示列名,可以调用columns函数
# print(kvDF.columns)

# 选择指定的列(注意引用列的不同方式)
# kvDF.select("value").show()     # 列为字符串类型
# kvDF.select(col("key")).show()  # 列为Column类型
# kvDF.select(column("key")).show() # 列为Column类型

# type(kvDF.select(column("key")))

--创建dataframe :2       
parquetFile = "/spark_demo/movies/movies.parquet"
movies = spark.read.parquet(parquetFile)
movies.printSchema()

movies.show(15, truncate=False)
# movies.show()   # 默认显示20行,并且截断

-- dataframe转换操作
# 1)select
# movies.select("movie_title","produced_year").show(5)
# 将年份列转换到年代列
movies2 = movies.select(col('movie_title') ,(col('produced_year') - col('produced_year') % 10).alias("produced_year"))                        
movies2.show(5)

# 2)selectExpr
movies.selectExpr("*","(produced_year - (produced_year % 10)) as decade").show(5)

# 使用SQL表达式和内置函数
# movies.selectExpr("count(distinct(movie_title)) as movies","count(distinct(actor_name)) as actors").show(5)

# 将年份列转换到年代列
# movies2 = movies.selectExpr("movie_title", "(produced_year-produced_year%10) as produced_decade")
# movies2.show(5)

# 3)filter, where
# movies.filter('produced_year < 2000').show(5)
movies.where('produced_year < 2000').show(5)
# movies.filter('produced_year >= 2000').show(5)
# movies.where('produced_year >= 2000').show(5)

# # 相等比较
# movies.filter('produced_year = 2000').show(5)
# movies.where('produced_year = 2000').show(5)

# # 不等比较使用的操作符是 !=
# movies.select("movie_title","produced_year").filter('produced_year != 2000').show(5)
# movies.select("movie_title","produced_year").where('produced_year != 2000').show(5)
# movies.select("movie_title","produced_year").filter(col('produced_year') != 2000).show(5)

# # 组合一个或多个比较表达式, 我们将使用OR和AND表达式运算符
# movies.filter('produced_year >= 2000' and length('movie_title') < 5).show(5)

# # 另一种实现相同结果的方法是调用filter函数两次
movies.filter('produced_year >= 2000').filter(length('movie_title') < 5).show(5)

# 4)distinct, dropDuplicates
# movies.select("movie_title").distinct().selectExpr("count(movie_title) as movies").show()
movies.dropDuplicates(["movie_title"]).selectExpr("count(movie_title) as movies").show()

# 5)sort(columns), orderBy(columns)
# movieTitles = movies.dropDuplicates(["movie_title"]) \
#                     .selectExpr("movie_title", "length(movie_title) as title_length")
# movieTitles.show(5)
                                                                
# movieTitles.sort('title_length').show(15) # 默认是升序

# movieTitles.sort(col('title_length').desc()).show(15,False)  # 降序
movieTitles.sort(desc('title_length')).show(15,False)       # 同上,降序
                                                                
# 降序
# movieTitles.orderBy(col('title_length').desc()).show(5)
                                                                
# 按不同的顺序对两列进行排序
# movieTitles.orderBy(col('title_length').desc(), 'produced_year').show(5)

# 6)limit(n)
# 首先创建一个带有演员名字及名字长度的DataFrame
actorNameDF = movies.select("actor_name") \
                    .distinct() \
                    .selectExpr("actor_name", "length(actor_name) as length")
# actorNameDF.show(5)         
    
# 按长度对名字排序,并获得top 10
# actorNameDF.orderBy(col('length').desc()).limit(10).show(truncate=False)
actorNameDF.orderBy(desc('length')).limit(10).show(truncate=False)

from pyspark.sql.types import Row

# 7)union(otherDataFrame)
# 我们想在标题为“12”的电影中添加一个缺失的演员,
# shortNameMovieDF = movies.where('movie_title == "12"')
# shortNameMovieDF.show()

# # 用一行创建一个DataFrame
forgottenActor = [Row("Brychta, Edita", "12", 2007)]
forgottenActorRDD = spark.sparkContext.parallelize(forgottenActor)     #  生成RDD

forgottenActorDF = spark.createDataFrame(forgottenActorRDD, shortNameMovieDF.schema)
forgottenActorDF.show()

# # 现在添加缺失的演员姓名
completeShortNameMovieDF = shortNameMovieDF.union(forgottenActorDF)
completeShortNameMovieDF.show()

# 8)withColumn(colName, column):向DataFrame增加一个新的列
# 增加一个新列,基于某一列表达式
# movies.withColumn("decade", (col('produced_year') - col('produced_year') % 10)).show(5)

# 现在用新值替换produced_year
movies.withColumn("produced_year", (col('produced_year') - col('produced_year') % 10)).show(5)

# 9)withColumnRenamed(existingColName, newColName):修改列名
movies.withColumnRenamed("actor_name", "actor") \
      .withColumnRenamed("movie_title", "title") \
      .withColumnRenamed("produced_year", "year") \
      .show(5)

# 10)drop(columnName1, columnName2):删除指定的列
movies.drop("actor_name", "me").printSchema()

movies.drop("actor_name", "me").show(5)

# 11)sample(fraction), sample(fraction, seed), sample(fraction, seed, withReplacement)
# 带有无放回和fraction的抽样
movies.sample(False, 0.0003).show(3)

# 带有有放回和fraction、seed的抽样
movies.sample(True, 0.0003, 123456).show(3)

# 12)randomSplit(weights):随机分割数据集
# 权重需要是一个Array
smallerMovieDFs = movies.randomSplit([0.6, 0.3, 0.1])
type(smallerMovieDFs)
type(smallerMovieDFs[0])

# 看看各部分计数之和是否等于1
t1 = movies.count()           # 31393
print(t1)

t2 = smallerMovieDFs[0].count() # 18881
print(t2)

t3 = smallerMovieDFs[0].count() + smallerMovieDFs[1].count() + smallerMovieDFs[2].count() # 31393
print(t3)

-- spark.stop() #关闭作业