Pyspark-元素的长度以及以后如何使用它

Question

所以我有一个单词数据集，我尽量只保留那些超过 6 个字符的单词：

data=dataset.map(lambda word: word,len(word)).filter(len(word)>=6)

时间：

print data.take(10)

它 returns 所有的单词，包括前 3 个，长度小于 6。我实际上并不想打印它们，而是继续处理长度大于 6 的数据。

因此，当我拥有合适的数据集时，我希望能够 select 我需要的数据，例如长度小于 15 并能够对其进行计算的数据.

甚至在 "word" 上应用函数。

有什么想法吗？？

Answer 1

你想要的是这个（未经测试）：

data=dataset.map(lambda word: (word,len(word))).filter(lambda t : t[1] >=6)

在map中，你return一个(word, length of word)和filter的元组将查看单词的长度（l）只取 l 大于或等于 6

的 (w,l)

Pyspark-length of an element and how to use it later