使用 clojure 获取字符串向量中唯一单词集的惯用方法

Question

我是 clojure 的新手，所以请原谅下面的愚蠢...但我试图在空格上拆分字符串向量，然后从整个结果向量中获取所有唯一字符串单个序列中的向量（我对序列类型不挑剔）。这是我试过的代码。

(require '[clojure.string :as str])
(require '[clojure.set :as set])
(def documents ["this is a cat" "this is a dog" "woof and a meow"])
(apply set/union (map #(str/split % #" ") documents))

我原以为这是 return 一组独特的词，即

#{"woof" "and" "a" "meow" "this" "is" "cat" "dog"}

但它 return 是一个非唯一词向量，即

["woof" "and" "a" "meow" "this" "is" "a" "cat" "this" "is" "a" "dog"]

最终，我只是将其包装在一个 set 调用中，即

(set (apply set/union (map #(str/split % #" ") documents)))

得到了我想要的：

#{"dog" "this" "is" "a" "woof" "and" "meow" "cat"}

但我不太明白为什么会这样。根据docs联合函数returns一个集合。那我为什么要得到一个向量？

第二个问题：另一种方法是

(distinct (apply concat (map #(str/split % #" ") documents)))

这也是我想要的 return，尽管是列表形式而不是集合形式。但是一些讨论 on this prior SO 表明 concat 异常缓慢，可能比集合操作（？）慢。

是这样吗...还有其他理由更喜欢其中一种方法（或第三种方法）吗？

我真的不在乎我得到的是向量还是集合的另一端，但最终会关心性能方面的考虑。我正在尝试通过实际生成对我的文本挖掘习惯有用的东西来学习 Clojure，因此最终这段代码将成为有效处理大量文本数据的工作流的一部分……是时候获取它了正确的，性能方面的，只是一般的不愚蠢，现在是。

谢谢！

Answer 1

clojure.set/union 对集合进行操作，但你给了它序列（str/split 的结果是一个字符串序列）。

(set (mapcat #(str/split % #" ") documents)) 应该给你你所需要的。

mapcat 将执行惰性 "map and concatenate" 操作。 set 将该序列转换为集合，同时丢弃重复项。

使用 clojure 获取字符串向量中唯一单词集的惯用方法

idiomatic way to use clojure to get set of unique words in vector of strings

vector

clojure

set