Sentence2vec 和 Word2vec 涉及停用词和命名实体

Question

我正在做一个 NLP 项目，涉及 sentence2vec。我假设我会使用预训练的词嵌入将标记转换为向量，然后继续进行句子嵌入。

因为我的句子涉及： can't, won't, aren't 等停用词 NLTK 会缩减为 {ca, wo , 是} + 不是。
所以我不能减少它们，我不想将它们作为停用词删除，因为下面提到的句子应该有不同的嵌入。

我叫普里扬克
我不叫普里扬克

另一个重要的疑问是如何将命名实体（例如像 Mark K. Hogg 这样的人的名字）合并到我的句子向量中。

Answer 1

你可以从这个list

中删除你不想成为停用词的那些

# Open a file and read it into memory
file = open('words.txt')
text = file.read()

# Apply the stoplist to the text
clean = [word for word in text.split() if word not in stoplist]

Sentence2vec 和 Word2vec 涉及停用词和命名实体

Sentence2vec and Word2vec involving stop words and Named Entities

python

nlp

word2vec

sentence-similarity