提取两个标记之间的所有子字符串
Extract all substrings between two markers
我有一个字符串:
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
我想要的是标记 start="&maker1"
和 end="/\n"
之间的子字符串列表。因此,预期结果是:
whatIwant = ["The String that I want", "Another string that I want"]
我已阅读此处的答案:
- Find string between two substrings [duplicate]
- How to extract the substring between two markers?
并尝试了这个但没有成功,
>>> import re
>>> mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> whatIwant = re.search("&marker1(.*)/\n", mystr)
>>> whatIwant.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
我该怎么做才能解决这个问题?另外,我有一个很长的字符串
>>> len(myactualstring)
7792818
考虑使用此选项 re.findall
:
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
matches = re.findall(r'&marker1\n(.*?)\s*/\n', mystr)
print(matches)
这会打印:
['The String that I want', 'Another string that I want']
下面是对正则表达式模式的解释:
&marker1 match a marker
\n newline
(.*?) match AND capture all content until reaching the first
\s* optional whitespace, followed by
/\n / and newline
请注意,re.findall
只会捕获出现在 (...)
捕获组中的内容,这就是您要提取的内容。
我该怎么做才能解决这个问题?
我会这样做:
import re
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
found = re.findall(r"\&marker1\n(.*?)/\n", mystr)
print(found)
输出:
['The String that I want ', 'Another string that I want ']
注意:
&
在 re
模式中有特殊含义,如果你想要文字并且你需要转义它 (\&
)
.
匹配除换行符以外的任何内容
findall
如果您只想要匹配的子字符串列表,则更适合选择 search
*?
是非贪婪的,在这种情况下 .*
也可以工作,因为 .
不匹配换行符,但在其他情况下你可能会结束匹配比你希望的更多
- 我使用所谓的原始字符串(r 前缀)使转义更容易
阅读模块 re
documentation 以讨论原始字符串的用法和具有特殊含义的隐式字符列表。
我有一个字符串:
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
我想要的是标记 start="&maker1"
和 end="/\n"
之间的子字符串列表。因此,预期结果是:
whatIwant = ["The String that I want", "Another string that I want"]
我已阅读此处的答案:
- Find string between two substrings [duplicate]
- How to extract the substring between two markers?
并尝试了这个但没有成功,
>>> import re
>>> mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> whatIwant = re.search("&marker1(.*)/\n", mystr)
>>> whatIwant.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
我该怎么做才能解决这个问题?另外,我有一个很长的字符串
>>> len(myactualstring)
7792818
考虑使用此选项 re.findall
:
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
matches = re.findall(r'&marker1\n(.*?)\s*/\n', mystr)
print(matches)
这会打印:
['The String that I want', 'Another string that I want']
下面是对正则表达式模式的解释:
&marker1 match a marker
\n newline
(.*?) match AND capture all content until reaching the first
\s* optional whitespace, followed by
/\n / and newline
请注意,re.findall
只会捕获出现在 (...)
捕获组中的内容,这就是您要提取的内容。
我该怎么做才能解决这个问题? 我会这样做:
import re
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
found = re.findall(r"\&marker1\n(.*?)/\n", mystr)
print(found)
输出:
['The String that I want ', 'Another string that I want ']
注意:
&
在re
模式中有特殊含义,如果你想要文字并且你需要转义它 (\&
).
匹配除换行符以外的任何内容findall
如果您只想要匹配的子字符串列表,则更适合选择search
*?
是非贪婪的,在这种情况下.*
也可以工作,因为.
不匹配换行符,但在其他情况下你可能会结束匹配比你希望的更多- 我使用所谓的原始字符串(r 前缀)使转义更容易
阅读模块 re
documentation 以讨论原始字符串的用法和具有特殊含义的隐式字符列表。