Intro

    pyspark join用法,看看api是不是就够了。。。。

数据构造

from pyspark.sql import SparkSession
from pyspark.sql import functions as
def get_or_create(app_name):
spark = (
SparkSession.builder.appName(app_name)
.config("spark.driver.maxResultSize", "10g")
.config("spark.sql.execution.arrow.enabled", "true")
.config("spark.dynamicAllocation.enabled", "false")
.config("spark.sql.crossJoin.enabled", "true")
.config("spark.kryoserializer.buffer.max", "512m")
.getOrCreate()
)
spark.sparkContext.setLogLevel("ERROR")
return
import pandas as
df1 = pd.DataFrame({"name": ["A", "B"],"name1": ["A", "B"], "age": [10, 20]})
df2 = pd.DataFrame({"name": ["A"],"name1": ["A"], "sex": ["male"]})
spark = get_or_create("spark")
df_spark1 = spark.createDataFrame(df1)
df_spark2 = spark.createDataFrame(df2)

df_spark1.show(truncate=False)
+----+-----+---+
|name|name1|age|
+----+-----+---+
|A |A |10 |
|B |B |20 |
+----+-----+---+
df_spark2.show(truncate=False)
+----+-----+----+
|name|name1|sex |
+----+-----+----+
|A |A |male|
+----+-----+----+

join

主要是关联列名相同or不同时的使用差异。help上都有,可自查

        :param other: Right side of the join
:param on: a string for the join column name, a list of column names,
a join expression (Column), or a list of Columns.
If `on` is a string or a list of strings indicating the name of the join column(s),
the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default ``inner``. Must be one of: ``inner``, ``cross``, ``outer``,
``full``, ``full_outer``, ``left``, ``left_outer``, ``right``, ``right_outer``,
``left_semi``, and ``left_anti``.

列名相同时

df_spark1.join(other=df_spark2,on=['name'],how='left').show()
+----+-----+---+-----+----+
|name|name1|age|name2| sex|
+----+-----+---+-----+----+
| B| B| 20| null|null|
| A| A| 10| A|male|
+----+-----+---+-----+----+

多个列名相同步

df_spark1.join(other=df_spark2,on=['name','name1'],how='left').show()
+----+-----+---+----+
|name|name1|age| sex|
+----+-----+---+----+
| A| A| 10|male|
| B| B| 20|null|
+----+-----+---+----+

关联的列名不同

df_spark1.join(other=df_spark2,on=[df_spark1.name==df_spark2.name1],how='left').show()
+----+-----+---+----+-----+----+
|name|name1|age|name|name1| sex|
+----+-----+---+----+-----+----+
| B| B| 20|null| null|null|
| A| A| 10| A| A|male|
+----+-----+---+----+-----+----+

多关联条件

df_spark1.join(other=df_spark2,on=[df_spark1.name==df_spark2.name1,df_spark1.name1==df_spark2.name],how='left').show()
+----+-----+---+----+-----+----+
|name|name1|age|name|name1| sex|
+----+-----+---+----+-----+----+
| A| A| 10| A| A|male|
| B| B| 20|null| null|null|
+----+-----+---+----+-----+----+
df_spark1.join(other=df_spark2,on=[df_spark1.name==df_spark2.name1,df_spark1.name1==df_spark2.name],how='outer').show()
+----+-----+---+----+-----+----+
|name|name1|age|name|name1| sex|
+----+-----+---+----+-----+----+
| A| A| 10| A| A|male|
| B| B| 20|null| null|null|
+----+-----+---+----+-----+----+

简单的用法介绍完毕

                                2022-08-04 于南京市江宁区九龙湖