在 gensim 中正确标记数据

Question

我对如何在 gensim 中正确标记数据感到有点困惑。我有一个文本文件 myfile.txt，其中包含以下文本

""" 
this is a very long string with a title


and some white space. Multiple sentences, too. This is nuts!
Yay! :):):) 
"""

我在gensim中使用LineReader('myfile.txt')加载这个文件来训练word2vec模型（当然我的数据比上面的例子大得多）

但是这个文本是否正确标记了？我问这个是因为 LineReader 似乎非常具体：

The format of files (either text, or compressed text files) in the path is one sentence = one line, with words already preprocessed and separated by whitespace. see https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence

我很困惑。我做的对吗？我应该如何为 LineReader 标记我的文本？

谢谢！

Answer 1

那会起作用，但是因为 Gensim 的 LineSentence class（我假设你的意思）在白色 space 上打破标记，你的线...

and some white space. Multiple sentences, too. This is nuts!

...将成为单词标记列表：

['and', 'some', 'white', 'space.', 'Multiple', 
'sentences,', 'too.', 'This', 'is', 'nuts!']

这意味着像 'space.'、'sentences,' 和 'nuts!' 这样的标记将被视为单词——甚至可能还会接收经过训练的单词向量（如果它们至少出现 min_count 次).

这可能不是您想要的 – 但也不一定是大问题。在一个足够大的语料库中，如果没有这个连接标点问题，你关心的所有词都会出现很多次，你可能仍然会得到很好的向量。

但更常见的是，您会对文本进行预处理以去除标点符号，或者将其与带有额外 space 分隔符的单词分开。（当你这样做时，标点符号本身变成了某种 'words'。）

在 gensim 中正确标记数据

tokenizing the data properly in gensim

python

gensim