Pyspark，使用行号和行中的单词列表创建RDD

Question

我正在处理一个纯文本文件，并试图创建一个由行号和该行中包含的单词列表组成的 RDD。

我将 RDD 创建为：

corpus = sc.textFile('article.txt')

然后我做了一个 zipWithIndex 和一个地图来获取行号和文本：

RDD2=RDD.zipWithIndex().map(lambda x: (x[1], x[0]))
for element in RDD2.take(2):
    print(element)

这导致：

(0, 'This is the 100th Etext file presented by Project Gutenberg, and')
(1, 'is presented in cooperation with World Library, Inc., from their')

如何将文本转换为列表？如果有任何建议，我将不胜感激。

Answer 1

你可以尝试拆分

RDD2=RDD.zipWithIndex().map(lambda x: (x[1], x[0].split()))
for element in RDD2.take(2):
    print(element)

Answer 2

如果您想使用 DataFrame 而不是 RDD：

from pyspark.sql import functions as sf
df = spark.createDataFrame(RDD2, schema="row_num: int, line: string") # convert to DataFrame
df2 = df.withColumn("words", sf.split(df.line,"\s+")).drop("line") # split on white spaces and drop the original line
df2.show(10)

Pyspark，使用行号和行中的单词列表创建RDD

Pyspark, create RDD with line number and list of words in line

python

apache-spark

rdd

pyspark