我如何从语料库中发现与另一个语料库不同的单词列表？ Python

Question

我有两个非结构化文本输入列表，我想找到区分listA和listB的单词。例如，如果 listA 是 "Harry Potter" 的文本，listB 是 "Ender's Game" 的文本，则 listA 的区分元素将是 [wand, magic, wizard, . . .] listB 的区别元素是 [ender, buggers, battle, . . .]

我已经尝试使用 python-nltk 模块，并且能够轻松地找到每个列表中最常见的单词，但这并不是我所追求的。

Answer 1

您可以使用同义词集来完成它。为了获得同义词集，NLTK 包含一个非常强大的库，称为 wordnet。

Wordnet 是一个很大的'database'（找不到更好的词）人类语言，不仅是英语，它还支持许多其他语言。

Synset 就像您听到一个术语时得到的类似想法。几乎像一个同义词，但没有那么严格。请转到 link，它的定义更好。

Synset Closures 是对你最有帮助的。例如，'bee'是动物、昆虫、生物；哈利·波特是虚构的、人类的、巫师。

from nltk.corpus import wordnet as wn
dog = wn.synset('dog.n.01')
hyper = lambda s: s.hypernyms()
list(dog.closure(hyper))

Heres a book that teach you the surface of nltk, is not very good but is a good place to start along with NTLK HOWTOs

如果你想要更深入的东西我帮不了你，我不知道 NTLK 提供给我们的大部分定义和函数，但是同义词集是一个很好的起点。

Answer 2

I've tried a bit with the python-nltk, and am able to easily find the most common words in each list, but not exactly what I'm after

我猜你的意思是它把 "and"、"the"、"of" 等词作为频率最高的词。这些单词不是很有用，它们基本上只是将单词组合在一起形成句子的粘合剂，您可以删除它们，但您需要一个包含 "useless" 个单词的列表，称为非索引字表，nltk 有这样一个列表 from nltk.corpus import stop words.

您可能想看看 TF.IDF 评分。这将为在一份文档中常见但通常不常见的词赋予更高的权重。通常你会使用一个大的语料库来计算哪些词在一般情况下是常见的。

我如何从语料库中发现与另一个语料库不同的单词列表？ Python

How do I discover list of words from corpus which distinguish from another corpus? Python

python

nlp

nltk