如何使用 Python 和 NLTK 从语料库中提取关键词（不是最常用的词）？

Question

我正在尝试从文本或语料库中提取关键词。这些不是出现频率最高的词，而是文本中 "about" 最多的词。我有一个比较示例，我生成的列表与示例列表非常不同。你能给我一个指导来生成一个好的关键词列表，其中不包括像 "thou" 和 "tis" 这样低意义的词吗？

我正在使用 "Romeo and Juliet" 作为我的文本。我的方法（参见下面的 Scott 和 Tribble）是将 R&J 与莎士比亚的完整戏剧进行比较，并且与完整的戏剧相比，在 R&J 中抽出出现频率更高的单词。这应该会剔除像 "the" 这样的常用词，但在我的代码中它不会。

我得到了很多像 "thou"、"she" 和 "tis" 这样的词，它们没有出现在他们的列表中，而且我没有得到像 "banished" 和 "churchyard"。我得到 "romeo"、"juliet"、"capulet" 和 "nurse"，所以我至少接近正确的轨道，如果不是真的在轨道上的话。

这是从文本中提取单词和百分比的函数：

def keywords(corpus, threshold=0):
    """ generates a list of possible keywords and the percentage of 
           occurrences.
          corpus (list): text or collection of texts
          threshold (int): min # of occurrences of word in corpus                    
              target text has threshold 3, ref corp has 0
          return percentKW: list of tuples (word, percent)                         
    """

    # get freqDist of corpus as dict. key is word, value = # occurences
    fdist = FreqDist(corpus)
    n = len(corpus)

    # create list of tuple of w meeting threshold & sort w/most common first
    t = [(k, v) for k, v in fdist.items() if v >= threshold]
    t = sorted(t, key=lambda tup: tup[1], reverse=True)

    # calculate number of total tokens
    n = len(corpus)

    # return list of tuples (word, percent word is of total tokens)
    percentKW =[(k, '%.2f'%(100*(v/n))) for k, v in t]
    return percentKW

这是调用代码的关键部分。 targetKW 是 R&J，refcorpKWDict 是完整的莎士比亚戏剧。

# iterate through text list of tuples
for w, p in targetKW:
    # for each word, store the percent in KWList
    targetPerc = float(p)
    refcorpPerc = float(refcorpKWDict.get(w, 0))
    # if % in text > % in reference corpus
    if (refcorpPerc or refcorpPerc == 0) and (targetPerc > refcorpPerc):
        diff = float('%.2f'%(targetPerc - refcorpPerc))
        # save result to KWList
        KWList.append((w, targetPerc, refcorpPerc, diff))

这是我到目前为止尝试过的方法：

将所有潜在关键词标准化为小写（有帮助）
创建自定义关键词短列表（文本和比较文本）。似乎有效，但没有告诉我任何信息
将 R&J 与节略的戏剧清单、戏剧 + 十四行诗以及布朗语料库进行比较（没有帮助）
检查了"banished"等潜在关键词的百分比。百分比远低于预期。我不确定如何解释。
设置潜在关键词的最小长度以消除 "ll" 和 "is" 等词（帮助）
用谷歌搜索了这个问题。（找不到任何东西）

我正在使用 IDLE 版本 3.5.2 在 Windows 10 上使用 Python 3.5.2。

来源：在 "Natural language processing with Python" (http://www.nltk.org/book/) 中，练习 4.24 是 "Read up on 'keyword linkage' (chapter 5 of (Scott & Tribble, 2006)). Extract keywords from NLTK's Shakespeare Corpus and using the NetworkX package, plot keyword linkage networks." 为了工作中的专业发展，我正在自己学习这本书。参考的 2006 年书籍是 "Textual patterns: key words and corpus analysis in language education"（尤其是第 58-60 页）

感谢您的宝贵时间。

Answer 1

两种可能有用的技术（并且可能偏离书本的方向）是词频逆文档频率（通常为 TFIDF）加权词...和搭配。

TFIDF 用于确定文档中的重要词，与更大的类似文档语料库相比。它通常用作自动分类（情感分析等）机器学习的初步条件

TFIDF 本质上是查看整个戏剧语料库，并根据单词在每个戏剧中的重要性为每个单词实例分配一个值，并根据该术语在整个语料库中的重要性进行加权。因此，理想情况下，您可以将 TFIDF 模型 'fit' 放入莎士比亚戏剧的整个语料库（包括罗密欧与朱丽叶），然后将 'Transform' 罗密欧与朱丽叶放入一系列单词评分中。然后你会找到得分最高的术语，这些术语在所有莎士比亚戏剧的背景下对罗密欧与朱丽叶来说最重要。

我发现一些有用的 TFIDF 指南...

https://buhrmann.github.io/tfidf-analysis.html

http://www.ultravioletanalytics.com/2016/11/18/tf-idf-basics-with-pandas-scikit-learn/

搭配在 NLTK 中可用并且相当容易实现。搭配寻找经常一起出现的短语和单词。这些对于指示文本是什么通常也很有用 'about'。 http://www.nltk.org/howto/collocations.html

如果您对任何一种技术感兴趣，我们很乐意帮助编写代码。

Answer 2

我已经在为我正在进行的项目复习 TF-IDF，所以我们开始吧。代码本身基本上不需要 Pandas 或 Numpy 函数，但强烈推荐 Pandas，因为我将其用作管理数据的首选。您需要 Scikit Learn 来进行 TFIDF 向量化。如果您还没有得到它，您将需要 install it first. Looks like just using pip install scikit-learn[alldeps] should do the trick but personally I use Anaconda，它已经预先安装了所有东西，所以我没有处理这方面的事情。我已经逐步分解了在罗密欧与朱丽叶中寻找重要术语的过程。下面还有更多必要的步骤来解释每个对象的内容，但仅包含必要步骤的完整代码位于底部。

循序渐进

from sklearn.feature_extraction.text import TfidfVectorizer

# Two sets of documents
# plays_corpus contains all documents in your corpus *including Romeo and Juliet*
plays_corpus = ['This is Romeo and Juliet','this is another play','and another','and one more']

#romeo is a list that contains *just* the text for Romeo and Juliet
romeo = [plays_corpus[0]] # must be in a list even if only one object

# Initialise your TFIDF Vectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Now create a model by fitting the vectorizer to your main plays corpus. This is essentially an array of TFIDF scores.
model =  tfidf_vectorizer.fit_transform(plays_corpus)

如果你很好奇，这就是数组的样子。每行代表语料库中的一个文档，而每列是按字母顺序排列的每个唯一术语。在这种情况下，行运行横跨两行，术语为 ['and'、'another'、'is'、'juliet'、'more'、'one', 'play', 'romeo', 'this'].

tfidf_vectorizer.fit_transform(plays_corpus).toarray()
array([[ 0.33406745,  0.        ,  0.41263976,  0.52338122,  0.        ,
         0.        ,  0.        ,  0.52338122,  0.41263976],
       [ 0.        ,  0.46580855,  0.46580855,  0.        ,  0.        ,
         0.        ,  0.59081908,  0.        ,  0.46580855],
       [ 0.62922751,  0.77722116,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.41137791,  0.        ,  0.        ,  0.        ,  0.64450299,
         0.64450299,  0.        ,  0.        ,  0.        ]])

接下来我们创建一个包含所有唯一项的列表 terms（我就是这样知道上面的唯一项的）。

terms = tfidf_vectorizer.get_feature_names()

现在我们有了 tfidf 分数的主要模型，它根据每个文档中的每个术语在其直接上下文（文档）和更大的上下文（语料库）中的重要性分别对每个术语进行评分。

为了找出罗密欧与朱丽叶中特定术语的分数，我们现在 .transform 该文档使用我们的模型。

romeo_scored = tfidf_vectorizer.transform(romeo) # note .transform NOT .fit_transform

这又创建了一个数组，但只有一行（因为只有一个文档）。

romeo_scored.toarray()
array([[ 0.33406745,  0.        ,  0.41263976,  0.52338122,  0.        ,
         0.        ,  0.        ,  0.52338122,  0.41263976]])

我们可以轻松地将这个数组转换为分数列表

# we first view the object as an array, 
# then flatten it as the array is currently like a list in a list.
# Then we transform that array object into a simple list object.
scores = romeo_scored.toarray().flatten().tolist()

现在我们在模型中有一个术语列表，以及每个术语的得分列表，特定于罗密欧与朱丽叶。这些有用的是相同的顺序，这意味着我们可以将它们放在一个元组列表中。

data = list(zip(terms,scores)

# Which looks like
[('and', 0.3340674500232949),
 ('another', 0.0),
 ('is', 0.41263976171812644),
 ('juliet', 0.5233812152405496),
 ('more', 0.0),
 ('one', 0.0),
 ('play', 0.0),
 ('romeo', 0.5233812152405496),
 ('this', 0.41263976171812644)]

现在我们只需要按分数排序就可以得到排名靠前的项目

# Here we sort the data using 'sorted',
# we choose to provide a sort key,
# our key is lambda x: x[1]
# x refers to the object we're processing (data)
# and [1] specifies the second part of the tuple - the score.
# x[0] would sort by the first part - the term.
# reverse = True switches from Ascending to Descending order.

sorted_data = sorted(data, key=lambda x: x[1],reverse=True)

最终，毕竟给了我们...

[('juliet', 0.5233812152405496),
 ('romeo', 0.5233812152405496),
 ('is', 0.41263976171812644),
 ('this', 0.41263976171812644),
 ('and', 0.3340674500232949),
 ('another', 0.0),
 ('more', 0.0),
 ('one', 0.0),
 ('play', 0.0)]

您可以通过对列表进行切片来限制前 N 个术语。

sorted_data[:3]
[('juliet', 0.5233812152405496),
 ('romeo', 0.5233812152405496),
 ('is', 0.41263976171812644)]

完整代码

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer

# Two sets of documents
# plays_corpus contains all documents in your corpus *including Romeo and Juliet*
plays_corpus = ['This is Romeo and Juliet','this is another play','and another','and one more']

#romeo is a list that contains *just* the text for Romeo and Juliet
romeo = [plays_corpus[0]] # must be in a list even if only one object

# Initialise your TFIDF Vectorizer object
tfidf_vectorizer = TfidfVectorizer()

# Now create a model by fitting the vectorizer to your main plays corpus, this creates an array of TFIDF scores
model = tfidf_vectorizer.fit_transform(plays_corpus)

romeo_scored = tfidf_vectorizer.transform(romeo) # note - .fit() not .fit_transform

terms = tfidf_vectorizer.get_feature_names()

scores = romeo_scored.toarray().flatten().tolist()

data = list(zip(terms,scores))

sorted_data = sorted(data,key=lambda x: x[1],reverse=True)

sorted_data[:5]

Answer 3

您的代码的问题是您对接受的内容过于宽容 "keyword"：任何在您的文本中出现频率 甚至稍微大一点 的词than 在参考语料库中将被视为关键字。从逻辑上讲，这应该让你找到大约一半在文本中没有特殊地位的词。

if (refcorpPerc or refcorpPerc == 0) and (targetPerc > refcorpPerc):
    # accept it as a "key word"

为了使测试更具选择性，请选择更大的阈值或使用更智能的度量，例如 "out of rank measure"（google），and/or 对候选关键字进行排名并仅保留列表的顶部，即相对频率增加最大的那些。

如何使用 Python 和 NLTK 从语料库中提取关键词（不是最常用的词）？

How do I pull key words (not most frequent words) out of a corpus using Python and NLTK?

python

corpus

nltk

循序渐进

完整代码