文本比较程序

Text Comparing Program

我正在制作一个程序,我应该通过返回文件中出现的所有单词的列表以及它们出现的次数来比较文本文件。我必须忽略称为停用词的单词列表,这样就不会检查它们出现的次数。对于第一部分,我需要检查该词是否在停用词中,如果是,我不计算该词,如果它不在停用词中,那么我在数据框中为该词创建一个全新的行,假设它在数据框中尚不存在,并将出现频率增加 1。每个文本文件都有一列。然而,我在这部分有点卡住了。我已经有了一些代码,但我需要填补空白。这是我目前所拥有的:

from tkinter.filedialog import askdirectory
import glob

import os 
import pandas as pd


def main():
    df = pd.DataFrame(columns =["TEXT FILE NAMES HERE..."])
    data_directory = askdirectory(initialdir = "/School_Files/CISC_121/Assignments/Assignment3/Data_Files")
    stopwords = open(os.getcwd() + "/" + "StopWords.txt") 



    text_files = glob.glob(data_directory + "/" + "*.txt")



    for f in text_files:
        infile = open(f, "r", encoding = "UTF-8")
        #now read the file and do all the word-counting etc...
        lines = infile.readlines()
        for line in lines:
            x = 0
            words = line.split()
            while (x < len(words)):
                """
                Check if the word is in the stopwords
                If it isn't, then add the word into a row in a dataframe, for the first occurence, then
                increment the value by 1
                Have a column for each book 
                """
                for line in infile:
                    if word in line:
                        found = True
                        word +=1 
                    else:
                        found = False

                x = x+1

main()

如果有人能帮我完成这一部分,我将不胜感激。请显示代码的更改。提前致谢!

我看你只是想统计单词的出现次数。为此,您可以使用字典而不是数据框。

对于停用词,请将其读入列表。

试试下面的代码。

stopwords = []
count_dictionary {}

with open(os.getcwd() + "/" + "StopWords.txt") as f:
    stopwords = f.read().splitlines()

#your code

while (x < len(words)):
    if word not in stopwords:
        if word in count_dictionary :
            count_dictionary[word] += 1
        else:
            count_dictionary[word] = 1