AttributeError: 'DataFrame' object has no attribute 'map'

AttributeError: 'DataFrame' object has no attribute 'map'

我想使用以下代码转换要添加的 spark 数据框:

from pyspark.mllib.clustering import KMeans
spark_df = sqlContext.createDataFrame(pandas_df)
rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

详细错误信息为:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-11-a19a1763d3ac> in <module>()
      1 from pyspark.mllib.clustering import KMeans
      2 spark_df = sqlContext.createDataFrame(pandas_df)
----> 3 rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data]))
      4 model = KMeans.train(rdd, 2, maxIterations=10, runs=30, initializationMode="random")

/home/edamame/spark/spark-2.0.0-bin-hadoop2.6/python/pyspark/sql/dataframe.pyc in __getattr__(self, name)
    842         if name not in self.columns:
    843             raise AttributeError(
--> 844                 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
    845         jc = self._jdf.apply(name)
    846         return Column(jc)

AttributeError: 'DataFrame' object has no attribute 'map'

有谁知道我在这里做错了什么?谢谢!

您不能 map 数据帧,但您可以将数据帧转换为 RDD 并通过 spark_df.rdd.map() 进行映射。在 Spark 2.0 之前,spark_df.map 将成为 spark_df.rdd.map() 的别名。使用 Spark 2.0,您必须先显式调用 .rdd

您可以使用 df.rdd.map(),因为 DataFrame does not have map or flatMap, but be aware of the implications 使用 df.rdd:

Converting to RDD breaks Dataframe lineage, there is no predicate pushdown, no column prunning, no SQL plan and less efficient PySpark transformations.

你应该怎么做?

请记住,高级 DataFrame API 配备了许多替代方案。首先,您可以使用 selectselectExpr.

另一个 example 正在使用 explode 而不是 flatMap(它存在于 RDD 中):

df.select($"name",explode($"knownLanguages"))
    .show(false)

结果:

+-------+------+
|name   |col   |
+-------+------+
|James  |Java  |
|James  |Scala |
|Michael|Spark |
|Michael|Java  |
|Michael|null  |
|Robert |CSharp|
|Robert |      |
+-------+------+

您还可以使用 withColumnUDF,具体取决于用例,或 DataFrame API.

中的其他选项