为一个很长的字符串提取两个标记之间的所有子字符串
Extract all substrings between two markers for a very long string
这是@Daweo 和@Tim Biegeleisen 提出的问题 . The 的延续,适用于小字符串。
但是对于非常大的字符串,正则表达式似乎不起作用。这可能是因为字符串长度有限制,如下所示:
>>> import re
>>> teststr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> for i in range(0, 23):
... teststr += teststr # creating a very long string here
...
>>> len(teststr)
603979776
>>> found = re.findall(r"\&marker1\n(.*?)/\n", newstr)
>>> len(found)
46
>>> found
['The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ']
我该怎么做才能解决这个问题并找到制造商 start="&maker1"
和 end="/\n"
之间的所有事件? re
可以处理的最大字符串长度是多少?
我无法re.findall
工作。现在我确实使用 re
但要找到标记的位置并手动提取子字符串。
locs_start = [match.start() for match in re.finditer("\&marker1", mylongstring)]
locs_end = [match.start() for match in re.finditer("/\n", mylongstring)]
substrings = []
for i in range(0, len(locs_start)):
substrings.append(mylongstring[locs_start[i]:locs_end[i]+1])
这是@Daweo 和@Tim Biegeleisen 提出的问题
但是对于非常大的字符串,正则表达式似乎不起作用。这可能是因为字符串长度有限制,如下所示:
>>> import re
>>> teststr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> for i in range(0, 23):
... teststr += teststr # creating a very long string here
...
>>> len(teststr)
603979776
>>> found = re.findall(r"\&marker1\n(.*?)/\n", newstr)
>>> len(found)
46
>>> found
['The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ', 'The String that I want ', 'Another string that I want ']
我该怎么做才能解决这个问题并找到制造商 start="&maker1"
和 end="/\n"
之间的所有事件? re
可以处理的最大字符串长度是多少?
我无法re.findall
工作。现在我确实使用 re
但要找到标记的位置并手动提取子字符串。
locs_start = [match.start() for match in re.finditer("\&marker1", mylongstring)]
locs_end = [match.start() for match in re.finditer("/\n", mylongstring)]
substrings = []
for i in range(0, len(locs_start)):
substrings.append(mylongstring[locs_start[i]:locs_end[i]+1])