从 python 中的文本文件打印句子时删除“\n”?
Removing "\n"s when printing sentences from text file in python?
我正在尝试打印文本文件(古腾堡计划电子书之一)中的句子列表。当我将文件打印为单个字符串时,它看起来不错:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
输出为:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
现在,当我将它拆分成句子时(作业是专门由 "splitting at the periods," 完成的,所以这是一个非常简化的拆分),我得到了这个:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
多余的“\n”字符从何而来,如何删除它们?
您可能不想使用正则表达式,但我会这样做:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
这应该将两个或多个 '\n'
的所有实例替换为单个换行符,因此您仍然有换行符,但没有 "extra" 换行符。
如果您想避免创建新列表,而是修改现有列表(归功于@gavriel 和 Andrew L.:我第一次发布答案时没有想到使用枚举):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
额外的换行符并不是真正多余的,我的意思是它们应该存在并且在您问题的文本中可见:'\n'
越多,space 文本行之间可见(即,章节标题和第一段之间有一个,版本和章节标题之间有很多行。
通过这个小例子,您将了解 \n
个字符的来源:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
这完全取决于您拆分文本的方式,上面的示例将给出以下输出:
3
19
这意味着如果使用 .
拆分文本,则有 3 个子字符串;如果使用 \n
作为分隔符,则有 19 个子字符串。您可以阅读更多关于 str.split
在您的情况下,您使用 .
拆分了文本,因此 3 个子字符串将包含多个换行符 \n
,要摆脱它们,您可以再次拆分这些子字符串,或者只使用 str.replace
摆脱它们
如果您想用一个 space 替换所有换行符,请执行以下操作:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]
文本使用换行符和句号来分隔句子。你有一个问题,就是用空字符串替换换行符会导致单词之间没有空格。在将 alice
拆分为 '.'
之前,我会使用类似于@elethan 解决方案的方法将 alice
中的所有多个新行替换为 '.'
然后你可以做 alice.split('.')
并且所有用多行分隔的句子将与最初用 .
分隔的句子一起适当地拆分。
那么你唯一的问题是版本号中的小数点。
file = open('11.txt','r+')
file.read().split('\n')
我正在尝试打印文本文件(古腾堡计划电子书之一)中的句子列表。当我将文件打印为单个字符串时,它看起来不错:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
输出为:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
现在,当我将它拆分成句子时(作业是专门由 "splitting at the periods," 完成的,所以这是一个非常简化的拆分),我得到了这个:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
多余的“\n”字符从何而来,如何删除它们?
您可能不想使用正则表达式,但我会这样做:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
这应该将两个或多个 '\n'
的所有实例替换为单个换行符,因此您仍然有换行符,但没有 "extra" 换行符。
如果您想避免创建新列表,而是修改现有列表(归功于@gavriel 和 Andrew L.:我第一次发布答案时没有想到使用枚举):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
额外的换行符并不是真正多余的,我的意思是它们应该存在并且在您问题的文本中可见:'\n'
越多,space 文本行之间可见(即,章节标题和第一段之间有一个,版本和章节标题之间有很多行。
通过这个小例子,您将了解 \n
个字符的来源:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
这完全取决于您拆分文本的方式,上面的示例将给出以下输出:
3
19
这意味着如果使用 .
拆分文本,则有 3 个子字符串;如果使用 \n
作为分隔符,则有 19 个子字符串。您可以阅读更多关于 str.split
在您的情况下,您使用 .
拆分了文本,因此 3 个子字符串将包含多个换行符 \n
,要摆脱它们,您可以再次拆分这些子字符串,或者只使用 str.replace
如果您想用一个 space 替换所有换行符,请执行以下操作:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]
文本使用换行符和句号来分隔句子。你有一个问题,就是用空字符串替换换行符会导致单词之间没有空格。在将 alice
拆分为 '.'
之前,我会使用类似于@elethan 解决方案的方法将 alice
中的所有多个新行替换为 '.'
然后你可以做 alice.split('.')
并且所有用多行分隔的句子将与最初用 .
分隔的句子一起适当地拆分。
那么你唯一的问题是版本号中的小数点。
file = open('11.txt','r+')
file.read().split('\n')