如何对 .txt 文件而不是带有 pywsd.utils 的句子进行词形还原？

Question

我对 Python 很陌生，我尝试学习基本的文本分析、主题建模等

我编写了以下代码来清理我的文本文件。与 NLTK 的 WordNetLemmatizer() 相比，我更喜欢 pywsed.utils lemmatize.sentence() 函数，因为它可以生成更清晰的文本。以下代码适用于句子：

from nltk.corpus import stopwords
from pywsd.utils import lemmatize_sentence
import string

s = "Dew drops fall from the leaves. Mary leaves the room. It's completed. Hello. This is trial. We went home. It was easier. We drank tea. These are Demo Texts. Right?"

lemm = lemmatize_sentence(s)
print (lemm)

stopword = stopwords.words('english') + list(string.punctuation)
removingstopwords = [word for word in lemm if word not in stopword]
print (removingstopwords, file=open("cleaned.txt","a"))

但是我没有做的是对目录中的原始文本文件进行词形还原。我想 lemmatize.sentence() 只需要字符串？

我设法用

读取一个文件的内容

with open ('a.txt',"r+", encoding="utf-8") as fin:
    lemm = lemmatize_sentence(fin.read())
print (lemm)

但是这次代码未能删除某些关键字，例如 "n't"、“'ll”、“'s”或“‘”以及导致文本未清理的标点符号。

1) 我做错了什么？我应该先标记化吗？（我也未能将其结果提供给 lemmatize.sentence() ）。

2) 如何获取没有任何格式的输出文件内容（没有单引号和括号的单词）？

非常感谢任何帮助。提前致谢。

Answer 1

只需对每一行逐行应用词形还原，然后用新行将其附加到字符串。所以本质上，它在做同样的事情。除了执行每一行，将其附加到临时字符串并用新行分隔每一行，然后在最后我们打印出临时字符串。您可以使用末尾的临时字符串作为最终输出。

my_temp_string = ""
with open ('a.txt',"r+", encoding="utf-8") as fin:
    for line in fin:
        lemm = lemmatize_sentence(line)
        my_temp_string += f'{lemm} \n'
print (my_temp_string)

如何对 .txt 文件而不是带有 pywsd.utils 的句子进行词形还原？

How to lemmatize a .txt file rather than a sentence with pywsd.utils?

python

normalizing

nltk

lemmatization

data-cleaning