写一个文本文件，有标点就换行

Question

我有一个文件，名为 input.txt，结构如下：

Hi Mark, my name is Lucas! I was born in Paris in 1998.
I currently live in Berlin.

我的目标是将文本变成小写，删除数字和标点符号并用 \n 替换（删除多余的），删除停用词并将其全部写入一个名为 output.txt 的新文件中. 所以，如果

stopwords = ['my', 'is', 'i', 'was', 'in'],

output.txt应该是

hi mark
name lucas
born paris
currently live berlin

但是如果我使用下面的代码

stopwords = ['my', 'is', 'i', 'was', 'in']
with open('input.txt', 'r', encoding='utf-8-sig') as file:
    new_file = open('output.txt', 'w', encoding='utf-8-sig')
    for line in file:
        corpus = line.lower()
        corpus = corpus.strip().replace('’', '\'')
        corpus = re.compile('[0-9{}]'.format(re.escape(string.punctuation))).sub('\n', corpus).replace('\n ', '\n').replace(' \n', '\n')
        corpus = re.sub(r'\n+', '\n', corpus).strip()
        corpus = ' '.join(w for w in corpus.split() if w not in stopwords) # (1)
        new_file.write(corpus)
        new_file.write('\n')
    new_file.close()

我明白了

hi mark name lucas born paris
currently live berlin

我该如何修复它，也许只更改代码行 (1)？

感谢您的帮助。

Answer 1

这段代码应该可以满足您的要求：

import re


# STOPWORDS
stopwords = ["my", "is", "i", "was", "in"]

# The below comprehension will build a regex pattern for each
# word, which will require the word to have a space behind and
# in front for it to be a match. This prevents matching lone
# letters contained within other words.
stopwords = [f"(?<=\s){stopword}(?=\s)" for stopword in stopwords]

# GET INPUT FROM FILE
with open("input.txt", "r") as input_txt:
    text = input_txt.read()

# FORMAT TEXT
text = re.sub("’", "'", text).lower()
text = re.sub(r"[^a-z\u00E0-\u00FF ]", "\n", text)
text = re.sub("|".join(stopwords), "", text)
text = re.sub("[ ]+", " ", text)
text = re.sub("\n[ ]+", "\n", text)
text = re.sub(r"\n+", "\n", text).strip()

# WRITE OUPUT TO FILE
with open("output.txt", "w") as output_txt:
    output_txt.write(text)

输出

hi mark
name lucas
born paris 
currently live berlin

根据您的要求，现在应该处理句子由换行符分隔但没有标点符号或数字的情况。之前的问题是我们用来分隔单词的 .split() 也会删除换行符。为了解决这个问题，我不得不使用相当多的正则表达式，所以这个答案变得更加复杂。

但是，我希望它能满足您的需求，如果它有效或者您需要帮助来理解它，请告诉我。

写一个文本文件，有标点就换行

Write a text file and go to a new line when there is punctuation

string

file-io

python-3.x

输出