spark + python + 过滤器问题

Question

如果有人能阐明以下代码片段问题，我们将不胜感激

lineStr= sc.textFile("/input/words.txt")
print (lineStr.collect())
['this file is created to count the no of texts', 'other wise i am just doing fine', 'lets see the output is there']

wc = lineStr.flatMap(lambda l: l.split(" ")).map(lambda x: (x,1)).reduceByKey(lambda w,c: w+c)
print (wc.glom().collect())
[[('this', 1), ('there', 1), ('i', 1), ('texts', 1), ('just', 1), ('fine', 1), ('is', 2), ('other', 1), ('created', 1), ('count', 1), ('of', 1), ('am', 1), ('no', 1), ('output', 1)], [('lets', 1), ('see', 1), ('the', 2), ('file', 1), ('doing', 1), ('wise', 1), ('to', 1)]]

现在，当我尝试使用以下方法过滤上述数据集的计数值大于 1 时，出现错误

s = wc.filter(lambda a,b:b>1)
print (s.collect())

error : vs = list(itertools.islice(iterator, batch))

TypeError: () missing 1 required positional argument: 'b'

Answer 1

您不能在 lambda 函数中解压元组，lambda a, b: 表示接受两个参数的函数，而不是接受元组作为参数的函数:

一个简单的修复方法是使用单个参数捕获元素，然后使用索引访问元组中的第二个元素：

wc.filter(lambda t: t[1] > 1).collect()
# [('is', 2), ('the', 2)]

spark + python + 过滤器问题

spark + python + filter issue

python

filter

word-count

apache-spark