如何将一串句子重新格式化为每行一个句子 Python
How to reformat a string of sentences to be one sentence per line Python
我有一个文件,它只是一个大字符串。在这个字符串中,有些句子以 3 个数字结尾,如下所示:
sees mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8 8
我想更改它,使 file/output 看起来像这样:
sees mouse . 1980 1 1
sheep erythrocytes mouse 1980 6 5
seen mouse 1980 8 8
这是我一直用来尝试解决这个问题的代码:
with open('ngram_test') as f:
for line in f:
#print(line)
for word in line.split():
print(word)
但是,这只会打印字符串中的每个单词和一个换行符。任何帮助将不胜感激!
使用正则表达式,您可以在每个模式出现后添加换行符 (\n
):
import re
s = "sees mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8 8"
pattern = r"(\d{4}\s\d{1,2}\s\d{1,2})"
for match in re.findall(pattern, s):
s = re.sub(match, f'{match}\n', s)
输出:
'sees mouse . 1980 1 1\n sheep erythrocytes mouse 1980 6 5\n seen mouse 1980 8 8\n'
您需要使用 regexp
并找到所需字符串的索引并稍后删除它们。
import re
pattern = re.compile(r'[a-zA-Z\.\s]+\d{4}\s+?\d{1,2}\s+?\d{1,2}')
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, s)])
假设输入受到问题中提供的字符串的限制,这将起作用。如果不是,则需要扩展模式。
一个单纯形正则表达式应该做
a='sees mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8 8'
count=0
for i in re.finditer('(\d \d \d)',a):
print(a[count:i.end()].strip())
count=i.end()
下面的代码正在寻找 3 个数字序列。
请注意,这是初学者级别的解决方案,不涉及正则表达式。
def is_int(value):
try:
int(value)
return True
except ValueError:
return False
with open('54928944.txt', 'r') as f:
numbers_counter = 0
one_line_words = []
line = f.read()
words = line.split(' ')
for word in words:
if is_int(word):
numbers_counter += 1
else:
numbers_counter = 0
one_line_words.append(word)
if numbers_counter == 3:
print(' '.join(one_line_words))
one_line_words = []
此代码有效:
import re
print(re.sub(r'(\d{4} \d{1,2} \d{1,2} )', r'\n', 'sees
mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8
8'))
要获取列表中的结果,您可以使用 re.split()。
来自re
内置库。
>>> import re
>>> with open(ngram_test) as f:
... s = f.read()
>>> splitted = re.split(r"r"\d*\s\d\s\d"", s)
>>> splitted
>>> ['sees mouse . ', ' sheep erythrocytes mouse ', ' seen mouse ', '']
我有一个文件,它只是一个大字符串。在这个字符串中,有些句子以 3 个数字结尾,如下所示:
sees mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8 8
我想更改它,使 file/output 看起来像这样:
sees mouse . 1980 1 1
sheep erythrocytes mouse 1980 6 5
seen mouse 1980 8 8
这是我一直用来尝试解决这个问题的代码:
with open('ngram_test') as f:
for line in f:
#print(line)
for word in line.split():
print(word)
但是,这只会打印字符串中的每个单词和一个换行符。任何帮助将不胜感激!
使用正则表达式,您可以在每个模式出现后添加换行符 (\n
):
import re
s = "sees mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8 8"
pattern = r"(\d{4}\s\d{1,2}\s\d{1,2})"
for match in re.findall(pattern, s):
s = re.sub(match, f'{match}\n', s)
输出:
'sees mouse . 1980 1 1\n sheep erythrocytes mouse 1980 6 5\n seen mouse 1980 8 8\n'
您需要使用 regexp
并找到所需字符串的索引并稍后删除它们。
import re
pattern = re.compile(r'[a-zA-Z\.\s]+\d{4}\s+?\d{1,2}\s+?\d{1,2}')
print([(m.start(0), m.end(0)) for m in re.finditer(pattern, s)])
假设输入受到问题中提供的字符串的限制,这将起作用。如果不是,则需要扩展模式。
一个单纯形正则表达式应该做
a='sees mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8 8'
count=0
for i in re.finditer('(\d \d \d)',a):
print(a[count:i.end()].strip())
count=i.end()
下面的代码正在寻找 3 个数字序列。
请注意,这是初学者级别的解决方案,不涉及正则表达式。
def is_int(value):
try:
int(value)
return True
except ValueError:
return False
with open('54928944.txt', 'r') as f:
numbers_counter = 0
one_line_words = []
line = f.read()
words = line.split(' ')
for word in words:
if is_int(word):
numbers_counter += 1
else:
numbers_counter = 0
one_line_words.append(word)
if numbers_counter == 3:
print(' '.join(one_line_words))
one_line_words = []
此代码有效:
import re
print(re.sub(r'(\d{4} \d{1,2} \d{1,2} )', r'\n', 'sees
mouse . 1980 1 1 sheep erythrocytes mouse 1980 6 5 seen mouse 1980 8
8'))
要获取列表中的结果,您可以使用 re.split()。
来自re
内置库。
>>> import re
>>> with open(ngram_test) as f:
... s = f.read()
>>> splitted = re.split(r"r"\d*\s\d\s\d"", s)
>>> splitted
>>> ['sees mouse . ', ' sheep erythrocytes mouse ', ' seen mouse ', '']