在 python 中的换行符后创建用于删除空格的正则表达式
Create a regular expression for deleting whitespaces after a newline in python
我想知道如何创建一个正则表达式来删除换行符后的空格,例如,如果我的文本是这样的:
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
我如何创造一些东西来获得:
["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afraid" ]
我尝试使用 "replace("-\n","")" 来尝试将它们组合在一起,但我只得到类似的东西:
["be","cause"] 和 ["ex","change"]
有什么建议吗?谢谢!!
import re
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''.lower()
s = re.sub(r'-\n\s*', '', s) # join hyphens
s = re.sub(r'[^\w\s]', '', s) # remove punctuation
print(s.split())
\s*
表示0个或多个空格。
据我所知,Alex Hall 的回答更充分地回答了您的问题(既明确地因为它是正则表达式,也隐含地因为它调整了大写并删除了标点符号),但它跳出了一个很好的生成器候选者。
在这里,使用生成器连接从类似堆栈的列表中弹出的标记:
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''
def condense(lst):
while lst:
tok = lst.pop(0)
if tok.endswith('-'):
yield tok[:-1] + lst.pop(0)
else:
yield tok
print(list(condense(s.split())))
# Result:
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather',
# 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid.']
import re
s.replace('-\n', '') #Replace the newline and - with a space
#Your s would now look like 'So she refused to ex change the feather and the rock be cause she was afraid.'
s = re.sub('\s\s+', '', s) #Replace 2 or more whitespaces with a ''
#Now your s would look like 'So she refused to exchange the feather and the rock because she was afraid.'
您可以使用可选的贪心表达式:
-?\n\s+
这需要用任何东西代替,参见a demo on regex101.com。
对于第二部分,我建议 nltk
这样你最终会得到:
import re
from nltk import word_tokenize
string = """
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
"""
rx = re.compile(r'-?\n\s+')
words = word_tokenize(rx.sub('', string))
print(words)
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid', '.']
我想知道如何创建一个正则表达式来删除换行符后的空格,例如,如果我的文本是这样的:
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
我如何创造一些东西来获得:
["so","she","refused","to","exchange", "the","feather","and","the","rock","because","she","was","afraid" ]
我尝试使用 "replace("-\n","")" 来尝试将它们组合在一起,但我只得到类似的东西:
["be","cause"] 和 ["ex","change"]
有什么建议吗?谢谢!!
import re
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''.lower()
s = re.sub(r'-\n\s*', '', s) # join hyphens
s = re.sub(r'[^\w\s]', '', s) # remove punctuation
print(s.split())
\s*
表示0个或多个空格。
据我所知,Alex Hall 的回答更充分地回答了您的问题(既明确地因为它是正则表达式,也隐含地因为它调整了大写并删除了标点符号),但它跳出了一个很好的生成器候选者。
在这里,使用生成器连接从类似堆栈的列表中弹出的标记:
s = '''So she refused to ex-
change the feather and the rock be-
cause she was afraid.'''
def condense(lst):
while lst:
tok = lst.pop(0)
if tok.endswith('-'):
yield tok[:-1] + lst.pop(0)
else:
yield tok
print(list(condense(s.split())))
# Result:
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather',
# 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid.']
import re
s.replace('-\n', '') #Replace the newline and - with a space
#Your s would now look like 'So she refused to ex change the feather and the rock be cause she was afraid.'
s = re.sub('\s\s+', '', s) #Replace 2 or more whitespaces with a ''
#Now your s would look like 'So she refused to exchange the feather and the rock because she was afraid.'
您可以使用可选的贪心表达式:
-?\n\s+
这需要用任何东西代替,参见a demo on regex101.com。
对于第二部分,我建议 nltk
这样你最终会得到:
import re
from nltk import word_tokenize
string = """
So she refused to ex-
change the feather and the rock be-
cause she was afraid.
"""
rx = re.compile(r'-?\n\s+')
words = word_tokenize(rx.sub('', string))
print(words)
# ['So', 'she', 'refused', 'to', 'exchange', 'the', 'feather', 'and', 'the', 'rock', 'because', 'she', 'was', 'afraid', '.']