spark key 在忽略键的所有元组中找到总元素

Question

我有如下元组。我想计算元素总数。我知道 countByKey() 按每个键给出元素数。我也知道 distinct().countByKey() 按键给出不同的元素。

但是我想要答案 5，因为总共有 5 个元素。

有同样的快速方法吗？

('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary']), 
('http://www.google.com/base/feeds/snippets/8175198959985911471', ['topics', 'presents', 'museums'])

Answer 1

如果您的数据集存储在 RDD 中，那么您只需添加两个步骤，一个转换和一个缩减。在下面的代码中，我使用 map 将元组转换为一个整数，然后我使用 reduction 对所有记录求和。

rdd = sc.parallelize([('http://www.google.com/base/feeds/snippets/11448761432933644608', ['spanish', 'vocabulary']), 
                      ('http://www.google.com/base/feeds/snippets/8175198959985911471', ['topics', 'presents', 'museums'])])
rdd.map(lambda x: len(x[1])).reduce(lambda x, y : x + y)
# returns 5

spark key 在忽略键的所有元组中找到总元素

spark key find total elements in all tuples ignoring keys

python

tuples

key

apache-spark

pyspark