Python NLTK 中的索引命令

Question

我有一个关于 NLTK 中 Python 索引命令的问题。首先，我举了一个简单的例子：

from nltk.book import *

text1.concordance("monstrous")

效果很好。现在，我有了自己的 .txt 文件，我想执行相同的命令。我有一个名为 "textList" 的列表，我想找到 "CNA" 这个词，所以我输入命令

textList.concordance('CNA')

然而，我得到了错误

AttributeError: 'list' object has no attribute 'concordance'.

在示例中，text1 不是列表吗？我想知道这里发生了什么。

Answer 1

.concordance()是一个特殊的nltk函数。所以你不能只在任何 python 对象（比如你的列表）上调用它。

更具体地说：.concordance() 是 Text class of nltk

中的一个方法

基本上，如果您想使用 .concordance()，您必须先实例化一个 Text 对象，然后在该对象上调用它。

Text

A Text is typically initialized from a given document or corpus. E.g.:
import nltk.corpus  
from nltk.text import Text  
moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))

.concordance()

concordance(word, width=79, lines=25)

Print a concordance for word with the specified context window. Word matching is not case-sensitive.

所以我想像这样的东西会起作用（未测试）

import nltk.corpus  
from nltk.text import Text  
textList = Text(nltk.corpus.gutenberg.words('YOUR FILE NAME HERE.txt'))
textList.concordance('CNA')

Answer 2

我用这段代码得到它：

import sys
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

def main():
    if not sys.argv[1]:
        return
    # read text
    text = open(sys.argv[1], "r").read()
    tokens = word_tokenize(text)
    textList = Text(tokens)
    textList.concordance('is')
    print(tokens)



if __name__ == '__main__':
    main()

基于this site

Answer 3

在 Jupyter 笔记本（或 Google Colab 笔记本）中，完整过程： MS Word 文件 --> 文本文件 --> 一个 NLTK 对象：

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.text import Text

import docx2txt

myTextFile = docx2txt.process("/mypath/myWordFile")
tokens = word_tokenize(myTextFile)
print(tokens)
textList = Text(tokens)
textList.concordance('contract')

Python NLTK 中的索引命令

Python concordance command in NLTK

python

nlp

nltk