文本表示：如何区分主题相似但极性相反的字符串？

Question

我一直在做某个语料库的聚类，通过获取它们的 tf-idf 来获得将句子组合在一起的结果，从gensim中检查相似性权重>某个阈值模型。

tfidf_dic = DocSim.get_tf_idf()
ds = DocSim(model,stopwords=stopwords, tfidf_dict=tfidf_dic)
sim_scores = ds.calculate_similarity(source_doc, target_docs)

问题在于，尽管设置了高阈值，但主题相似但 极性相反 的句子仍会这样聚集在一起：

Here is an example of the similarity weights obtained between "don't like it" & "i like it"

是否有任何其他方法、库或替代模型可以通过为它们分配非常低的相似性或相反的向量来有效地区分极性？

这是为了让输出 "i like it" 和 "dont like it" 位于不同的集群中。

PS：如果有任何概念上的错误，请原谅我，因为我是 NLP 的新手。提前致谢！

Answer 1

问题在于您如何表示文档。 Tf-idf 适合表示关键字起着更重要作用的长文档。在这里，可能是 tf-idf 的 idf 部分忽略了极性，因为像 "no" 或 "not" 这样的负粒子会出现在大多数文档中，它们将永远存在获得低权重。

我建议尝试一些可能捕获极性的神经嵌入。如果你想继续使用 Gensim，你可以尝试 doc2vec 但你需要大量的训练数据。如果您没有太多数据来估计表示，我会使用一些预训练的嵌入。

平均词嵌入（你可以加载FastText embeddings in Gensim). Alternatively, if you want a stronger model, you can try BERT or another large pre-trained model from the Transformers package。

Answer 2

不幸的是，仅基于词组的简单文本表示不能很好地区分这种语法驱动的意义反转。

该方法需要对有意义的短语以及分层的、语法驱动的词间依赖关系敏感，才能对其进行建模。

使用 convolutional/recurrent 技术的更深层次的神经网络或树模型句子结构的方法做得更好。

有关想法，请参见示例...

"Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank"

...或更新的摘要演示文稿...

"Representations for Language: From Word Embeddings to Sentence Meanings"

文本表示：如何区分主题相似但极性相反的字符串？

Text representations : How to differentiate between strings of similar topic but opposite polarities?

nlp

cluster-analysis

similarity

gensim