写一个文本文件,有标点就换行
Write a text file and go to a new line when there is punctuation
我有一个文件,名为 input.txt
,结构如下:
Hi Mark, my name is Lucas! I was born in Paris in 1998.
I currently live in Berlin.
我的目标是将文本变成小写,删除数字和标点符号并用 \n 替换(删除多余的),删除停用词并将其全部写入一个名为 output.txt
的新文件中.
所以,如果
stopwords = ['my', 'is', 'i', 'was', 'in']
,
output.txt
应该是
hi mark
name lucas
born paris
currently live berlin
但是如果我使用下面的代码
stopwords = ['my', 'is', 'i', 'was', 'in']
with open('input.txt', 'r', encoding='utf-8-sig') as file:
new_file = open('output.txt', 'w', encoding='utf-8-sig')
for line in file:
corpus = line.lower()
corpus = corpus.strip().replace('’', '\'')
corpus = re.compile('[0-9{}]'.format(re.escape(string.punctuation))).sub('\n', corpus).replace('\n ', '\n').replace(' \n', '\n')
corpus = re.sub(r'\n+', '\n', corpus).strip()
corpus = ' '.join(w for w in corpus.split() if w not in stopwords) # (1)
new_file.write(corpus)
new_file.write('\n')
new_file.close()
我明白了
hi mark name lucas born paris
currently live berlin
我该如何修复它,也许只更改代码行 (1)?
感谢您的帮助。
这段代码应该可以满足您的要求:
import re
# STOPWORDS
stopwords = ["my", "is", "i", "was", "in"]
# The below comprehension will build a regex pattern for each
# word, which will require the word to have a space behind and
# in front for it to be a match. This prevents matching lone
# letters contained within other words.
stopwords = [f"(?<=\s){stopword}(?=\s)" for stopword in stopwords]
# GET INPUT FROM FILE
with open("input.txt", "r") as input_txt:
text = input_txt.read()
# FORMAT TEXT
text = re.sub("’", "'", text).lower()
text = re.sub(r"[^a-z\u00E0-\u00FF ]", "\n", text)
text = re.sub("|".join(stopwords), "", text)
text = re.sub("[ ]+", " ", text)
text = re.sub("\n[ ]+", "\n", text)
text = re.sub(r"\n+", "\n", text).strip()
# WRITE OUPUT TO FILE
with open("output.txt", "w") as output_txt:
output_txt.write(text)
输出
hi mark
name lucas
born paris
currently live berlin
根据您的要求,现在应该处理句子由换行符分隔但没有标点符号或数字的情况。之前的问题是我们用来分隔单词的 .split()
也会删除换行符。为了解决这个问题,我不得不使用相当多的正则表达式,所以这个答案变得更加复杂。
但是,我希望它能满足您的需求,如果它有效或者您需要帮助来理解它,请告诉我。
我有一个文件,名为 input.txt
,结构如下:
Hi Mark, my name is Lucas! I was born in Paris in 1998.
I currently live in Berlin.
我的目标是将文本变成小写,删除数字和标点符号并用 \n 替换(删除多余的),删除停用词并将其全部写入一个名为 output.txt
的新文件中.
所以,如果
stopwords = ['my', 'is', 'i', 'was', 'in']
,
output.txt
应该是
hi mark
name lucas
born paris
currently live berlin
但是如果我使用下面的代码
stopwords = ['my', 'is', 'i', 'was', 'in']
with open('input.txt', 'r', encoding='utf-8-sig') as file:
new_file = open('output.txt', 'w', encoding='utf-8-sig')
for line in file:
corpus = line.lower()
corpus = corpus.strip().replace('’', '\'')
corpus = re.compile('[0-9{}]'.format(re.escape(string.punctuation))).sub('\n', corpus).replace('\n ', '\n').replace(' \n', '\n')
corpus = re.sub(r'\n+', '\n', corpus).strip()
corpus = ' '.join(w for w in corpus.split() if w not in stopwords) # (1)
new_file.write(corpus)
new_file.write('\n')
new_file.close()
我明白了
hi mark name lucas born paris
currently live berlin
我该如何修复它,也许只更改代码行 (1)?
感谢您的帮助。
这段代码应该可以满足您的要求:
import re
# STOPWORDS
stopwords = ["my", "is", "i", "was", "in"]
# The below comprehension will build a regex pattern for each
# word, which will require the word to have a space behind and
# in front for it to be a match. This prevents matching lone
# letters contained within other words.
stopwords = [f"(?<=\s){stopword}(?=\s)" for stopword in stopwords]
# GET INPUT FROM FILE
with open("input.txt", "r") as input_txt:
text = input_txt.read()
# FORMAT TEXT
text = re.sub("’", "'", text).lower()
text = re.sub(r"[^a-z\u00E0-\u00FF ]", "\n", text)
text = re.sub("|".join(stopwords), "", text)
text = re.sub("[ ]+", " ", text)
text = re.sub("\n[ ]+", "\n", text)
text = re.sub(r"\n+", "\n", text).strip()
# WRITE OUPUT TO FILE
with open("output.txt", "w") as output_txt:
output_txt.write(text)
输出
hi mark
name lucas
born paris
currently live berlin
根据您的要求,现在应该处理句子由换行符分隔但没有标点符号或数字的情况。之前的问题是我们用来分隔单词的 .split()
也会删除换行符。为了解决这个问题,我不得不使用相当多的正则表达式,所以这个答案变得更加复杂。
但是,我希望它能满足您的需求,如果它有效或者您需要帮助来理解它,请告诉我。