Python - 正则表达式 - 匹配特定字符之间的字符
Python - Regex - match characters between certain characters
我有一个文本文件,我想 match/findall/parse 某些字符之间的所有字符 ([\n" 要匹配的文本 "\n])。文本本身在结构和包含的字符方面可能彼此有很大差异(它们可以包含所有可能的字符)。
我之前发过这个问题(抱歉重复了),但到目前为止问题还没有解决,所以现在我想更准确地描述这个问题。
文件中的文本是这样构建的:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
我想要的输出应该是一个列表(例如),分隔符之间的每个文本作为一个元素,如下所示:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
我尝试用 Regex 和两个解决方案来解决这个问题,我想出了相应的输出:
my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']
好吧,这个已经很接近了。它按预期列出了前两个元素,但不幸的是没有列出第三个元素,因为它里面有换行符。
my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char."\n ], \n [\n "like *.;#]§< and many "" more."\n ], \n [\n "plus there are even\nnewlines\n \n in it.']
好的,这次每个元素都包含在内,但列表中只有一个元素,而且前瞻似乎没有像我想象的那样工作。
那么什么是正确的正则表达式来获得我想要的输出?
为什么第二种方法不包括前瞻?
或者是否有更干净、更快速的方法来获得我想要的东西(beautifulsoup 或其他方法?)?
非常感谢您的帮助和提示。
我正在使用 python 3.6.
您应该使用 DOTALL
标志来匹配换行符
print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))
输出
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']
你可以使用模式
(?s)\[[^"]*"(.*?)"[^]"]*\]
捕获括号内 "
内的每个元素:
https://regex101.com/r/SguEAU/1
然后,您可以使用带有 re.sub
的列表推导式将每个捕获的子字符串中的白色 space 字符(包括换行符)替换为单个正常的 space:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]
结果:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
我有一个文本文件,我想 match/findall/parse 某些字符之间的所有字符 ([\n" 要匹配的文本 "\n])。文本本身在结构和包含的字符方面可能彼此有很大差异(它们可以包含所有可能的字符)。
我之前发过这个问题(抱歉重复了),但到目前为止问题还没有解决,所以现在我想更准确地描述这个问题。
文件中的文本是这样构建的:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
我想要的输出应该是一个列表(例如),分隔符之间的每个文本作为一个元素,如下所示:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']
我尝试用 Regex 和两个解决方案来解决这个问题,我想出了相应的输出:
my_list = re.findall(r'(?<=\[\n {8}\").*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.']
好吧,这个已经很接近了。它按预期列出了前两个元素,但不幸的是没有列出第三个元素,因为它里面有换行符。
my_list = re.findall(r'(?<=\[\n {8}\")[\s\S]*(?=\"\n {8}\])', test)
print (my_list)
['this is a text and its supposed to contain every possible char."\n ], \n [\n "like *.;#]§< and many "" more."\n ], \n [\n "plus there are even\nnewlines\n \n in it.']
好的,这次每个元素都包含在内,但列表中只有一个元素,而且前瞻似乎没有像我想象的那样工作。
那么什么是正确的正则表达式来获得我想要的输出? 为什么第二种方法不包括前瞻?
或者是否有更干净、更快速的方法来获得我想要的东西(beautifulsoup 或其他方法?)?
非常感谢您的帮助和提示。
我正在使用 python 3.6.
您应该使用 DOTALL
标志来匹配换行符
print(re.findall(r'\[\n\s+"(.*?)"\n\s+\]', test, re.DOTALL))
输出
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even\nnewlines\n\nin it.']
你可以使用模式
(?s)\[[^"]*"(.*?)"[^]"]*\]
捕获括号内 "
内的每个元素:
https://regex101.com/r/SguEAU/1
然后,您可以使用带有 re.sub
的列表推导式将每个捕获的子字符串中的白色 space 字符(包括换行符)替换为单个正常的 space:
test ="""
[
"this is a text and its supposed to contain every possible char."
],
[
"like *.;#]§< and many "" more."
],
[
"plus there are even
newlines
in it."
]"""
output = [re.sub('\s+', ' ', m.group(1)) for m in re.finditer(r'(?s)\[[^"]*"(.*?)"[^]"]*\]', test)]
结果:
['this is a text and its supposed to contain every possible char.', 'like *.;#]§< and many "" more.', 'plus there are even newlines in it.']