如何仅在 Pyspark 中使用 map() 将 (key,values) 对转换为值
How to use map() to convert (key,values) pair to values only in Pyspark
我在 PySpark 中有这段代码到 .
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
wordCounts = wordPairs.reduceByKey(lambda x,y:x+y)
print wordCounts.collect()
#PRINTS--> [('rat', 2), ('elephant', 1), ('cat', 2)]
from operator import add
totalCount = (wordCounts
.map(<< FILL IN >>)
.reduce(<< FILL IN >>))
#SHOULD PRINT 5
#(wordCounts.values().sum()) // does the trick but I want to this with map() and reduce()
I need to use a reduce() action to sum the counts in wordCounts and then divide by the number of unique words.
* 但首先我需要 map() 将由(键,值)对组成的 RDD wordCounts 对映射到值 .
的 RDD
这就是我卡住的地方。我在下面尝试过类似的方法,但其中 none 行得通:
.map(lambda x:x.values())
.reduce(lambda x:sum(x)))
AND,
.map(lambda d:d[k] for k in d)
.reduce(lambda x:sum(x)))
如有任何帮助,我们将不胜感激!
终于找到答案了,是这样的-->
wordCounts
.map(lambda x:x[1])
.reduce(lambda x,y:x + y)
是的,.map 中的 lambda 函数接受一个元组 x 作为参数,并且 return 通过 x[1](元组中的第二个索引)输入第二个元素。您还可以将元组作为参数和 return 第二个元素,如下所示:
.map(lambda (x,y) : y)
先生Tompsett,我也让这个工作:
from operator import add
x = (w
.map(lambda x: x[1])
.reduce(add))
除了 map-reduce,您还可以使用 aggregate
,这应该会更快:
In [7]: x = sc.parallelize([('rat', 2), ('elephant', 1), ('cat', 2)])
In [8]: x.aggregate(0, lambda acc, value: acc + value[1], lambda acc1, acc2: acc1 + acc2)
Out[8]: 5
我在 PySpark 中有这段代码到 .
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']
wordsRDD = sc.parallelize(wordsList, 4)
wordCounts = wordPairs.reduceByKey(lambda x,y:x+y)
print wordCounts.collect()
#PRINTS--> [('rat', 2), ('elephant', 1), ('cat', 2)]
from operator import add
totalCount = (wordCounts
.map(<< FILL IN >>)
.reduce(<< FILL IN >>))
#SHOULD PRINT 5
#(wordCounts.values().sum()) // does the trick but I want to this with map() and reduce()
I need to use a reduce() action to sum the counts in wordCounts and then divide by the number of unique words.
* 但首先我需要 map() 将由(键,值)对组成的 RDD wordCounts 对映射到值 .
的 RDD这就是我卡住的地方。我在下面尝试过类似的方法,但其中 none 行得通:
.map(lambda x:x.values())
.reduce(lambda x:sum(x)))
AND,
.map(lambda d:d[k] for k in d)
.reduce(lambda x:sum(x)))
如有任何帮助,我们将不胜感激!
终于找到答案了,是这样的-->
wordCounts
.map(lambda x:x[1])
.reduce(lambda x,y:x + y)
是的,.map 中的 lambda 函数接受一个元组 x 作为参数,并且 return 通过 x[1](元组中的第二个索引)输入第二个元素。您还可以将元组作为参数和 return 第二个元素,如下所示:
.map(lambda (x,y) : y)
先生Tompsett,我也让这个工作:
from operator import add
x = (w
.map(lambda x: x[1])
.reduce(add))
除了 map-reduce,您还可以使用 aggregate
,这应该会更快:
In [7]: x = sc.parallelize([('rat', 2), ('elephant', 1), ('cat', 2)])
In [8]: x.aggregate(0, lambda acc, value: acc + value[1], lambda acc1, acc2: acc1 + acc2)
Out[8]: 5