pyspark AttributeError: 'NoneType' object has no attribute 'setCallSite'

原创

bonelee 2023-05-31 10:28:09 ©著作权

文章标签 spark sql python 文章分类 HarmonyOS 后端开发

©著作权归作者所有：来自51CTO博客作者bonelee的原创作品，请联系作者获取转载授权，否则将追究法律责任

pyspark:

AttributeError: 'NoneType' object has no attribute 'setCallSite'

我草，是pyspark的bug。解决方法：

print("Approximately joining on distance smaller than 0.6:")
    distance_min = model.approxSimilarityJoin(imsi_proc_df, imsi_proc_df, 1e6, distCol="JaccardDistance") \
        .select(col("datasetA.id").alias("idA"),
                col("datasetB.id").alias("idB"),
                col("JaccardDistance")) #.filter("idA=idB")
    print(distance_min.show())
    print("*"*88)
    print(imsi_proc_df.show())

    key = Vectors.sparse(53, [1, 3], [1.0, 1.0])
    print(model.approxNearestNeighbors(imsi_proc_df, key, 2).show())
    print("start calculate find botnet!")
    print("*"*99)
    print("time start:", time.time())
    print(type(distance_min), dir(distance_min))
    print(dir(distance_min.toLocalIterator))

    ############################################## add this line to solve
    distance_min.sql_ctx.sparkSession._jsparkSession = spark_app._jsparkSession
    distance_min._sc = spark_app._sc
    #############################################

    similarity_val_rdd = distance_min.toLocalIterator #.collect()
    print("time end:", time.time())
    print(similarity_val_rdd)
    print("*"*99)
    try:
        G = ConnectedGraph()
        ddos_ue_list = []
        for item in similarity_val_rdd():
            imsi, imsi2, jacard_similarity_val = item["idA"], item["idB"], item["JaccardDistance"]
            print("???", imsi, imsi2, jacard_similarity_val)

Description

reproducing the bug from the example in the documentation:

import pyspark
from pyspark.ml.linalg import Vectors
from pyspark.ml.stat import Correlation
spark = pyspark.sql.SparkSession.builder.getOrCreate()
dataset = [[Vectors.dense([1, 0, 0, -2])],
 [Vectors.dense([4, 5, 0, 3])],
 [Vectors.dense([6, 7, 0, 8])],
 [Vectors.dense([9, 0, 0, 1])]]
dataset = spark.createDataFrame(dataset, ['features'])
df = Correlation.corr(dataset, 'features', 'pearson')
df.collect()

This produces the following stack trace:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-92-e7889fa5d198> in <module>()
     11 dataset = spark.createDataFrame(dataset, ['features'])
     12 df = Correlation.corr(dataset, 'features', 'pearson')
---> 13 df.collect()

/opt/spark/python/pyspark/sql/dataframe.py in collect(self)
    530         [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
    531         """
--> 532         with SCCallSiteSync(self._sc) as css:
    533             sock_info = self._jdf.collectToPython()
    534         return list(_load_from_socket(sock_info, BatchedSerializer(PickleSerializer())))

/opt/spark/python/pyspark/traceback_utils.py in __enter__(self)
     70     def __enter__(self):
     71         if SCCallSiteSync._spark_stack_depth == 0:
---> 72             self._context._jsc.setCallSite(self._call_site)
     73         SCCallSiteSync._spark_stack_depth += 1
     74 

AttributeError: 'NoneType' object has no attribute 'setCallSite'

Analysis:

Somehow the dataframe properties `df.sql_ctx.sparkSession._jsparkSession`, and `spark._jsparkSession` do not match with the ones available in the spark session.

The following code fixes the problem (I hope this helps you narrowing down the root cause)

df.sql_ctx.sparkSession._jsparkSession = spark._jsparkSession
df._sc = spark._sc

df.collect()

>>> [Row(pearson(features)=DenseMatrix(4, 4, [1.0, 0.0556, nan, 0.4005, 0.0556, 1.0, nan, 0.9136, nan, nan, 1.0, nan, 0.4005, 0.9136, nan, 1.0], False))]