TypeError: <lambda>() missing 1 required positional argument: 'y'

Question

我在我的 jupyter 中运行以下 spark 代码并收到此错误。

import re

def normalizewords(text):
    return re.compile(r'\W+',re.UNICODE).split(text.lower())

inputs = sc.textFile('Book.txt')
words = inputs.flatMap(normalizewords)
# wordscount = words.countByValue()
wordcount = words.map(lambda x :(x,1)).reduceByKey(lambda x,y : (x+y))
sortedwords = wordcount.map(lambda x,y: (y,x)).sortByKey()
sortedwords.collect()

WordCount 的输出将如下所示：

[('self', 111),
 ('employment', 75),
 ('building', 33),
 ('an', 178),
 ('internet', 26),
 ('business', 383),
 ('of', 970),
 ('one', 100)]

首先我要做的是如下：

[(111,'self),
 (75,'employment')]

我已经尝试了所有可能的 lambda x,y : y,x 方法，但没有任何效果。如果我把右边的 (x,y) 放在括号中，它会给出无效的语法错误。

Answer 1

sortedwords = wordcount.sortByKey()

这就是您所需要的，没有额外的 lambda。

UPD. 我想你可以用这个。但是，为什么不使用 DF？

sortedwords = wordcount.sortBy(lambda x: (x[1]), ascending=False).sortBy(lambda x: (x[0]), ascending=True).map(lambda x: (x[1], x[0]))

TypeError: <lambda>() missing 1 required positional argument: 'y'

TypeError: <lambda>() missing 1 required positional argument: 'y'

lambda

apache-spark

rdd

pyspark