写一个文本文件,有标点就换行

Write a text file and go to a new line when there is punctuation

我有一个文件,名为 input.txt,结构如下:

Hi Mark, my name is Lucas! I was born in Paris in 1998.
I currently live in Berlin.

我的目标是将文本变成小写,删除数字和标点符号并用 \n 替换(删除多余的),删除停用词并将其全部写入一个名为 output.txt 的新文件中. 所以,如果

stopwords = ['my', 'is', 'i', 'was', 'in'],

output.txt应该是

hi mark
name lucas
born paris
currently live berlin

但是如果我使用下面的代码

stopwords = ['my', 'is', 'i', 'was', 'in']
with open('input.txt', 'r', encoding='utf-8-sig') as file:
    new_file = open('output.txt', 'w', encoding='utf-8-sig')
    for line in file:
        corpus = line.lower()
        corpus = corpus.strip().replace('’', '\'')
        corpus = re.compile('[0-9{}]'.format(re.escape(string.punctuation))).sub('\n', corpus).replace('\n ', '\n').replace(' \n', '\n')
        corpus = re.sub(r'\n+', '\n', corpus).strip()
        corpus = ' '.join(w for w in corpus.split() if w not in stopwords) # (1)
        new_file.write(corpus)
        new_file.write('\n')
    new_file.close()

我明白了

hi mark name lucas born paris
currently live berlin

我该如何修复它,也许只更改代码行 (1)?

感谢您的帮助。

这段代码应该可以满足您的要求:

import re


# STOPWORDS
stopwords = ["my", "is", "i", "was", "in"]

# The below comprehension will build a regex pattern for each
# word, which will require the word to have a space behind and
# in front for it to be a match. This prevents matching lone
# letters contained within other words.
stopwords = [f"(?<=\s){stopword}(?=\s)" for stopword in stopwords]

# GET INPUT FROM FILE
with open("input.txt", "r") as input_txt:
    text = input_txt.read()

# FORMAT TEXT
text = re.sub("’", "'", text).lower()
text = re.sub(r"[^a-z\u00E0-\u00FF ]", "\n", text)
text = re.sub("|".join(stopwords), "", text)
text = re.sub("[ ]+", " ", text)
text = re.sub("\n[ ]+", "\n", text)
text = re.sub(r"\n+", "\n", text).strip()

# WRITE OUPUT TO FILE
with open("output.txt", "w") as output_txt:
    output_txt.write(text)

输出

hi mark
name lucas
born paris 
currently live berlin

根据您的要求,现在应该处理句子由换行符分隔但没有标点符号或数字的情况。之前的问题是我们用来分隔单词的 .split() 也会删除换行符。为了解决这个问题,我不得不使用相当多的正则表达式,所以这个答案变得更加复杂。

但是,我希望它能满足您的需求,如果它有效或者您需要帮助来理解它,请告诉我。