如何在文本文件中的一行中查找单词的频率 - Pyspark

Question

我已经设法制作了一个如下所示的 RDD（在 Pyspark 中）：

[('This', (0, 1)), ('is', (0, 1)), ('the', (0, 1)), ('100th', (0, 1))...]

我使用了以下代码： RDD=sc.textFile(_filepath_)

test1 = RDD.zipWithIndex().flatMap(lambda x: ((i,(x[1],1)) for i in x[0].split(" ")))

实际上，[(word, (line, freq)] 所以上面的词来自文件中的第一行（因此是 0）并且 freq 是文本中所有词的 1，我希望它能计数对于整个 RDD，这个词出现在这一行的次数。我想到了 .reduceByKey(lambda x, y: x + y) 但是当我之后执行像 .take(5) 这样的操作时，它冻结了（Ubuntu 终端 - Oracle VirtualBox 有很多 RAM/disk space，如果有帮助）。

我基本上需要的是，如果单词 'This' 在第一行并且出现了 7 次，那么结果将是 [('This', (0, 7)), ...]

Answer 1

解决了，但答案可能不是最优的。

RDD = sc.textFile(_filepath_) 
test1 = RDD.zipWithIndex().flatMap(lambda x: ((i,(x[1],1)) for i in x[0].split(" "))) 
test2 = test1.map(lambda x: ((x[0], x[1][0]), x[1][1])).reduceByKey(lambda x, y: x + y) 
Result_RDD = test2.map(lambda x: (x[0][0], (x[0][1], x[1])))

如何在文本文件中的一行中查找单词的频率 - Pyspark

How to find the frequency of a word in a line, in a text file - Pyspark

python

word-count

apache-spark

rdd

pyspark