字词索引程序

Question

我最初在此处 post 编辑了这个问题，但后来被告知 post 对其进行代码审查；但是，他们告诉我，我的问题需要在此处 posted。我会尽力更好地解释我的问题，希望不会造成混淆。我正在尝试编写一个将执行以下操作的单词索引程序：

1) 将 stop_words.txt 文件读入仅包含停用词的字典（使用与计时相同类型的字典），称为 stopWordDict。（警告：在将停用词添加到 stopWordDict 之前从停用词末尾删除换行符（‘\n’））

2) 一次一行地处理 WarAndPeace.txt 文件以构建单词索引词典（称为 wordConcordanceDict），其中包含键的“主要”单词，并将其关联的行号列表作为它们的值.

3) 按键按字母顺序遍历 wordConcordanceDict，生成一个文本文件，其中包含按字母顺序打印的索引词及其相应的行号。

我在一个带有简短停用词列表的小文件上测试了我的程序，它工作正常（在下面提供了一个示例）。结果如我所料，主要单词列表及其行数，不包括 stop_words_small.txt 文件中的单词。我测试的小文件和我实际尝试测试的主文件之间的唯一区别是主文件更长并且包含标点符号。所以我运行遇到的问题是当我运行我的程序与主文件时，我得到的结果比预期的多。我得到比预期更多结果的原因是标点符号没有从文件中删除。

例如，下面是结果的一部分，其中我的代码将单词 Dmitri 计为四个单独的单词，因为单词后面的大小写和标点符号不同。如果我的代码正确地删除了标点符号，那么 Dmitri 这个词将被算作一个词，后跟所有找到的位置。我的输出也将单词的大小写分开，所以我的代码也没有将文件变成小写。

我的代码当前显示的内容：

Dmitri : [2528, 3674, 3687, 3694, 4641, 41131]

Dmitri! : [16671, 16672]

Dmitri, : [2530, 3676, 3685, 13160, 16247]

dmitri : [2000]

我的代码应该显示什么：

dmitri : [2000, 2528, 2530, 3674, 3676, 3685, 3687, 3694, 4641, 13160, 16671, 16672, 41131]

单词被定义为由任何非字母分隔的字母序列。大写字母和小写字母之间也不应该有区别，但我的程序也将它们分开；但是，空行将计入行号。

下面是我的代码，如果有人能看一下它并就我做错了什么给我任何反馈，我将不胜感激。提前谢谢你。

import re

def main():
    stopFile = open("stop_words.txt","r")
    stopWordDict = dict()

    for line in stopFile:
        stopWordDict[line.lower().strip("\n")] = []

    hwFile = open("WarAndPeace.txt","r")
    wordConcordanceDict = dict()
    lineNum = 1

    for line in hwFile:
        wordList = re.split(" |\n|\.|\"|\)|\(", line)
        for word in wordList:
            word.strip(' ')
            if (len(word) != 0) and word.lower() not in stopWordDict:
                if word in wordConcordanceDict:
                    wordConcordanceDict[word].append(lineNum)
                else:
                    wordConcordanceDict[word] = [lineNum]
        lineNum = lineNum + 1

    for word in sorted(wordConcordanceDict):
        print (word," : ",wordConcordanceDict[word])


if __name__ == "__main__":
main()

这里的另一个例子和参考是我用一小部分停用词测试的小文件，效果很好。

stop_words_small.txt 文件

a, about, be, by, can, do, i, in, is, it, of, on, the, this, to, was

small_file.txt

This is a sample data (text) file to
be processed by your word-concordance program.

The real data file is much bigger.

正确输出

bigger: 4

concordance: 2

data: 1 4

file: 1 4

much: 4

processed: 2

program: 2

real: 4

sample: 1

text: 1

word: 2

your: 2

Answer 1

你可以这样做：

import re
from collections import defaultdict

wordConcordanceDict = defaultdict(list)

with open('stop_words_small.txt') as sw:
    words = (line.strip() for line in sw)
    stop_words = set(words)

with open('small_file.txt') as f:
    for line_number, line in enumerate(f, 1):
        words = (re.sub(r'[^\w\s]','',word).lower() for word in line.split())
        good_words = (word for word in words if word not in stop_words)
        for word in good_words:
            wordConcordanceDict[word].append(line_number)

for word in sorted(wordConcordanceDict):
    print('{}: {}'.format(word, ' '.join(map(str, wordConcordanceDict[word]))))

输出：

bigger: 4
data: 1 4
file: 1 4
much: 4
processed: 2
program: 2
real: 4
sample: 1
text: 1
wordconcordance: 2
your: 2

✿ 明天我会添加解释，这里已经很晚了；）。同时，您可以在评论中询问您是否不清楚代码的某些部分。

字词索引程序

Word & Line Concordance Program

python

punctuation