text2vec 词嵌入：复合一些但不是全部

text2vec word embeddings : compound some tokens but not all

我正在使用 {text2vec} 词嵌入构建属于某个语义类别的相似术语的字典。

是否可以在语料库中复合一些标记，但不是全部？例如，我想计算类似于“future generation”或“rising generation”的术语，但这些搭配当然在原始语料库中作为单独的术语出现。我想知道 gsub "rising generation" --> "rising_generation" 是否是一种不好的做法，而不将所有其他经常一起出现的术语（例如“气候变化”）组合在一起。

谢谢！

是的，很好。它可能会或可能不会完全按照您想要的方式工作，但值得一试。

您可能想查看 collocations in text2vec, which can automatically detect and join phrases for you. You can certainly join phrases on top of that if you want. In Gensim in Python I would use the Phrases 相同内容的代码。

鉴于训练词向量通常不会花费太长时间，因此最好尝试不同的技术，看看哪种技术更适合您的目标。

text2vec 词嵌入：复合一些但不是全部

text2vec word embeddings : compound some tokens but not all

nlp

tokenize

word-embedding

text2vec