pyspark sample函数 pyspark column

转载

jimoshalengzhou 2023-08-10 13:13:15

文章标签 pyspark sample函数 spark 机器学习数据挖掘 sql 文章分类 Spark 大数据

本节来学习pyspark.sql.Column。博客中代码基于spark 2.4.4版本。不同版本函数会有不同，详细请参考官方文档。博客案例中用到的数据可以点击此处下载（提取码：2bd5）

from pyspark.sql import SparkSession

spark = SparkSession.Builder().master('local').appName('sparksqlColumn').getOrCreate()

df = spark.read.csv('../data/data.csv', header='True')

df.show(3)

+---+----+----+------+----+------+----------+-------------------+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|          3分命中率|篮板|助攻|得分|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
|  0|勇士|  胜|    客|  10|    23|     0.435|              0.444|   6|  11|  27|
|  1|国王|  胜|    客|   8|    21|     0.381|0.28600000000000003|   3|   9|  27|
|  2|小牛|  胜|    主|  10|    19|     0.526|              0.462|   3|   7|  29|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
only showing top 3 rows

df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- 对手: string (nullable = true)
 |-- 胜负: string (nullable = true)
 |-- 主客场: string (nullable = true)
 |-- 命中: string (nullable = true)
 |-- 投篮数: string (nullable = true)
 |-- 投篮命中率: string (nullable = true)
 |-- 3分命中率: string (nullable = true)
 |-- 篮板: string (nullable = true)
 |-- 助攻: string (nullable = true)
 |-- 得分: string (nullable = true)

df.show(3)

+---+----+----+------+----+------+----------+-------------------+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|          3分命中率|篮板|助攻|得分|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
|  0|勇士|  胜|    客|  10|    23|     0.435|              0.444|   6|  11|  27|
|  1|国王|  胜|    客|   8|    21|     0.381|0.28600000000000003|   3|   9|  27|
|  2|小牛|  胜|    主|  10|    19|     0.526|              0.462|   3|   7|  29|
+---+----+----+------+----+------+----------+-------------------+----+----+----+
only showing top 3 rows

from pyspark.sql.types import IntegerType, FloatType

df = df.withColumn('命中', df['命中'].cast(IntegerType()))
df = df.withColumn('投篮数', df['投篮数'].cast(IntegerType()))
df = df.withColumn('投篮命中率', df['投篮命中率'].cast(FloatType()))
df = df.withColumn('3分命中率', df['3分命中率'].cast(FloatType()))
df = df.withColumn('篮板', df['篮板'].cast(IntegerType()))
df = df.withColumn('助攻', df['助攻'].cast(IntegerType()))
df = df.withColumn('得分', df['得分'].cast(IntegerType()))

df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- 对手: string (nullable = true)
 |-- 胜负: string (nullable = true)
 |-- 主客场: string (nullable = true)
 |-- 命中: integer (nullable = true)
 |-- 投篮数: integer (nullable = true)
 |-- 投篮命中率: float (nullable = true)
 |-- 3分命中率: float (nullable = true)
 |-- 篮板: integer (nullable = true)
 |-- 助攻: integer (nullable = true)
 |-- 得分: integer (nullable = true)

alias

为列取个别名

df.select(df['对手'].alias('比赛对手')).show(3)

+--------+
|比赛对手|
+--------+
|    勇士|
|    国王|
|    小牛|
+--------+
only showing top 3 rows

asc

升序排列一个列

asc_nulls_first() 空值在前
asc_nulls_last() 空值在后

# 根据得分升序排列，并打印前5个对手和得分
df.select('对手', '得分').orderBy(df['得分'].asc()).show(5)

+------+----+
|  对手|得分|
+------+----+
|  灰熊|  20|
|  掘金|  21|
|  灰熊|  22|
|  鹈鹕|  26|
|步行者|  26|
+------+----+
only showing top 5 rows

astype()

转换数据类型，是cast的别名

between

一个布尔表达式，如果该表达式的值在给定列之间，则计算为true。可用于筛选满足条件的Row

# 筛选出得分在15-10之间数据(包含边界)
df1 = df.select('对手', df['得分'].between(15, 20).alias('selected_df'))
df1.filter(df1['selected_df'] == True).show()

+----+-----------+
|对手|selected_df|
+----+-----------+
|灰熊|       true|
+----+-----------+

bitwiseAND：二进制与操作

bitwiseOR：二进制或操作

bitwiseOR：二进制异或操作

contains(other)

包含其他元素。根据字符串匹配返回一个布尔列

df.filter(df['对手'].contains('小')).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|3分命中率|篮板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  2|小牛|  胜|    主|  10|    19|     0.526|    0.462|   3|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

desc()

desc_nulls_first()

desc_nulls_last()

降序排列

df.select('对手', '得分').orderBy(df['得分'].desc()).show(5)

+------+----+
|  对手|得分|
+------+----+
|  爵士|  56|
|开拓者|  48|
|  太阳|  48|
|  猛龙|  38|
|  灰熊|  38|
+------+----+
only showing top 5 rows

endswith(other)

boolen值，以other结尾的字符串

df.filter(df['对手'].endswith('熊')).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|3分命中率|篮板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  3|灰熊|  负|    主|   8|    20|       0.4|     0.25|   5|   8|  22|
|  6|灰熊|  负|    客|   6|    19|     0.316|    0.222|   4|   8|  20|
| 12|灰熊|  胜|    主|  11|    25|      0.44|    0.429|   4|   8|  38|
| 16|灰熊|  胜|    客|   9|    20|      0.45|      0.5|   5|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

eqNullSafe(other)

空值/指定值判断

from pyspark.sql import Row

df1 = spark.createDataFrame([Row(id=1, value='foo'), Row(id=2, value=None)])
df1.select('id', 'value', df1.value.eqNullSafe('foo'), df1.value.eqNullSafe(None)).show()

+---+-----+---------------+----------------+
| id|value|(value <=> foo)|(value <=> NULL)|
+---+-----+---------------+----------------+
|  1|  foo|           true|           false|
|  2| null|          false|            true|
+---+-----+---------------+----------------+

isNotNull()

当前表达式非空，返回True

df1.select('id', 'value', df1.value.isNotNull()).show()

+---+-----+-------------------+
| id|value|(value IS NOT NULL)|
+---+-----+-------------------+
|  1|  foo|               true|
|  2| null|              false|
+---+-----+-------------------+

isNull()

当前表达式为空，返回True

df1.select('id', 'value', df1.value.isNull()).show()

+---+-----+---------------+
| id|value|(value IS NULL)|
+---+-----+---------------+
|  1|  foo|          false|
|  2| null|           true|
+---+-----+---------------+

isin()

一个布尔表达式，如果自变量的求值包含该表达式的值，则该表达式为true。

# 取出对手为['灰熊', '76人', '骑士']的数据
df.filter(df['对手'].isin(['灰熊', '76人', '骑士'])).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|3分命中率|篮板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  3|灰熊|  负|    主|   8|    20|       0.4|     0.25|   5|   8|  22|
|  4|76人|  胜|    客|  10|    20|       0.5|     0.25|   3|  13|  27|
|  6|灰熊|  负|    客|   6|    19|     0.316|    0.222|   4|   8|  20|
|  7|76人|  负|    主|   8|    21|     0.381|    0.429|   4|   7|  29|
| 11|骑士|  胜|    主|   8|    21|     0.381|    0.429|  11|  13|  35|
| 12|灰熊|  胜|    主|  11|    25|      0.44|    0.429|   4|   8|  38|
| 16|灰熊|  胜|    客|   9|    20|      0.45|      0.5|   5|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

like(other)

类似于SQL中的like，返回基于SQL LIKE匹配的布尔列。

# 返回以‘灰’开头的
df.select('对手', '胜负', '主客场', '得分').where(df['对手'].like('灰%')).show()

+----+----+------+----+
|对手|胜负|主客场|得分|
+----+----+------+----+
|灰熊|  负|    主|  22|
|灰熊|  负|    客|  20|
|灰熊|  胜|    主|  38|
|灰熊|  胜|    客|  29|
+----+----+------+----+

otherwise(value)

计算条件列表，并返回多个可能的结果表达式之一。

# 增加标志列flag,将灰熊标志为1，其他对手标志为0
from pyspark.sql import functions as F
df.withColumn('flag', F.when(df['对手'] == '灰熊', 1).otherwise(0)).show(5)

+---+----+----+------+----+------+----------+---------+----+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|3分命中率|篮板|助攻|得分|flag|
+---+----+----+------+----+------+----------+---------+----+----+----+----+
|  0|勇士|  胜|    客|  10|    23|     0.435|    0.444|   6|  11|  27|   0|
|  1|国王|  胜|    客|   8|    21|     0.381|    0.286|   3|   9|  27|   0|
|  2|小牛|  胜|    主|  10|    19|     0.526|    0.462|   3|   7|  29|   0|
|  3|灰熊|  负|    主|   8|    20|       0.4|     0.25|   5|   8|  22|   1|
|  4|76人|  胜|    客|  10|    20|       0.5|     0.25|   3|  13|  27|   0|
+---+----+----+------+----+------+----------+---------+----+----+----+----+
only showing top 5 rows

rlike(other)

SQL RLIKE表达式（与Regex相似）。根据正则表达式匹配返回布尔列。

df.filter(df['对手'].rlike('熊$')).show()

+---+----+----+------+----+------+----------+---------+----+----+----+
|_c0|对手|胜负|主客场|命中|投篮数|投篮命中率|3分命中率|篮板|助攻|得分|
+---+----+----+------+----+------+----------+---------+----+----+----+
|  3|灰熊|  负|    主|   8|    20|       0.4|     0.25|   5|   8|  22|
|  6|灰熊|  负|    客|   6|    19|     0.316|    0.222|   4|   8|  20|
| 12|灰熊|  胜|    主|  11|    25|      0.44|    0.429|   4|   8|  38|
| 16|灰熊|  胜|    客|   9|    20|      0.45|      0.5|   5|   7|  29|
+---+----+----+------+----+------+----------+---------+----+----+----+

startswith(other)

返回一个boolen列，以other开始的返回为True

df.select('对手', df['对手'].startswith('灰').alias('灰%')).show(5)

+----+-----+
|对手|  灰%|
+----+-----+
|勇士|false|
|国王|false|
|小牛|false|
|灰熊| true|
|76人|false|
+----+-----+
only showing top 5 rows

substr(startPos, length)

返回一个Column，它是该列的子字符串。

# 返回对手名称的第一个字符, 并命名为‘子串’
df.select('对手', df['对手'].substr(1, 1).alias('子串')).show(3)

+----+----+
|对手|子串|
+----+----+
|勇士|  勇|
|国王|  国|
|小牛|  小|
+----+----+
only showing top 3 rows

when(condition, value)

计算条件列表，并返回多个可能的结果表达式之一。如果未调用Column.otherwise（），则对于不匹配的条件，将不返回None。

# 查找得分大于25的对手,标记为1，否则标记为0，标记列名为‘score_flag’
from pyspark.sql import functions as F

df.select('对手','得分', F.when(df['得分'] > 25, 1).otherwise(0).alias('score_flag')).show(5)

+----+----+----------+
|对手|得分|score_flag|
+----+----+----------+
|勇士|  27|         1|
|国王|  27|         1|
|小牛|  29|         1|
|灰熊|  22|         0|
|76人|  27|         1|
+----+----+----------+
only showing top 5 rows

getItem(key)

该表达式从列表中的第一个位置获取项目，或从字典中通过键获取一个项目

df1 = spark.createDataFrame([([1, 2], {'key': 'value'})], ['l', 'd'])
df1.select(df1.l.getItem(0), df1.d.getItem('key')).show()

+----+------+
|l[0]|d[key]|
+----+------+
|   1| value|
+----+------+

文末致敬官方文档：http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。

上一篇：mysql创建索引长度太长 mysql索引字段长度意义

下一篇：pythonbreak函数 pythonbreak的作用

提问和评论都可以，用心的回复会被更多人看到评论

发布评论

相关文章

官方博客	全部文章	热门标签	班级博客
了解我们	网站地图	意见反馈

鸿蒙开发者社区	51CTO学堂
51CTO	软考资讯