在 pyspark 中读取 Column<COLUMN-NAME> 的内容
read content of Column<COLUMN-NAME> in pyspark
我正在使用 spark 1.5.0
我创建了一个如下所示的数据框,我正在尝试从此处读取一列
>>> words = tokenizer.transform(sentenceData)
>>> words
DataFrame[label: bigint, sentence: string, words: array<string>]
>>> words['words']
Column<words>
我想读出句子中的所有单词(vocab)。我怎样才能阅读这个
编辑 1:错误仍然存在
我现在 运行 在 spark 2.0.0 中遇到这个错误
>>> wordsData.show()
+--------------------+--------------------+
| desc| words|
+--------------------+--------------------+
|Virat is good bat...|[virat, is, good,...|
| sachin was good| [sachin, was, good]|
|but modi sucks bi...|[but, modi, sucks...|
| I love the formulas|[i, love, the, fo...|
+--------------------+--------------------+
>>> wordsData
DataFrame[desc: string, words: array<string>]
>>> vocab = wordsData.select(explode('words')).rdd.flatMap(lambda x: x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 305, in flatMap
return self.mapPartitionsWithIndex(func, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 330, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2383, in __init__
self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
编辑分辨率 - 1 -
您可以:
from pyspark.sql.functions import explode
words.select(explode('words')).rdd.flatMap(lambda x: x)
我正在使用 spark 1.5.0
我创建了一个如下所示的数据框,我正在尝试从此处读取一列
>>> words = tokenizer.transform(sentenceData)
>>> words
DataFrame[label: bigint, sentence: string, words: array<string>]
>>> words['words']
Column<words>
我想读出句子中的所有单词(vocab)。我怎样才能阅读这个
编辑 1:错误仍然存在
我现在 运行 在 spark 2.0.0 中遇到这个错误
>>> wordsData.show()
+--------------------+--------------------+
| desc| words|
+--------------------+--------------------+
|Virat is good bat...|[virat, is, good,...|
| sachin was good| [sachin, was, good]|
|but modi sucks bi...|[but, modi, sucks...|
| I love the formulas|[i, love, the, fo...|
+--------------------+--------------------+
>>> wordsData
DataFrame[desc: string, words: array<string>]
>>> vocab = wordsData.select(explode('words')).rdd.flatMap(lambda x: x)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 305, in flatMap
return self.mapPartitionsWithIndex(func, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 330, in mapPartitionsWithIndex
return PipelinedRDD(self, f, preservesPartitioning)
File "/opt/BIG-DATA/spark-2.0.0-bin-hadoop2.7/python/pyspark/rdd.py", line 2383, in __init__
self._jrdd_deserializer = self.ctx.serializer
AttributeError: 'SparkSession' object has no attribute 'serializer'
编辑分辨率 - 1 -
您可以:
from pyspark.sql.functions import explode
words.select(explode('words')).rdd.flatMap(lambda x: x)