无论顺序如何,按分隔符列表拆分字符串
splitting strings by list of separators irrespective of order
我有一个 string text
和一个 list names
- 我想拆分
text
每次出现names
的一个元素.
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
期望输出:
output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
常见问题解答
text
并不总是以 names
元素开头。感谢 VictorLee 指出这一点。我不关心那个主要部分,但其他人可能会关心,所以感谢人们回答“两种情况”
- 分隔符在
names
中的顺序独立于它们在text
中的出现。
names
中的分隔符 是唯一的,但可以在整个 text
中出现多次。因此输出的 lists 多于 names
的 strings.
text
永远不会有相同的唯一 names
元素连续出现两次/<>。
- 最终我希望输出是一个列表列表,其中每个分割
text
切片对应于它的分隔符,它被分裂了。列表顺序无关紧要。
re.split()
不允许我使用列表作为分隔符参数。我可以 re.compile()
我的分隔符列表吗?
更新: Thomas 代码最适合我的情况,但我注意到一个我以前没有意识到的警告:
names
的某些元素前面有 'Mrs.' 或 'Mr.',而 text
中只有一些相应的匹配项前面有 'Mrs.'或 'Mr.'
到目前为止:
names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, *name = name_components
return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [[name, clist.rstrip()] for name, clist in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
) if clist is not None
]
print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [86], in <module>
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in <genexpr>(.0)
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in create_regex_string(name)
109 if len(name_components) == 1:
110 return re.escape(name)
--> 111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
ValueError: not enough values to unpack (expected at least 1, got 0)
您的示例与您想要的输出不完全匹配。此外,尚不清楚示例输入是否 总是 具有此结构,例如在每个句子的末尾加上句点。
话虽如此,您可能想尝试这种肮脏的方法:
import re
text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'
names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split
output = []
sentences = text.split(".")
for name in names:
for sentence in sentences:
if name in sentence:
output.append([name, f"{rsplit(sentence)[-1]}."])
print(output)
这输出:
[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]
如果您正在寻找使用正则表达式的方法,那么:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
打印:
[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
说明
首先我们根据过去的 names 参数动态创建一个正则表达式 regex1
为:
(?=Mike|Monika)
当你拆分输入时,因为任何传递的名称都可能出现在输入的开头或结尾,你最终可能会在结果中得到空字符串,因此我们将过滤掉它们并得到:
['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']
然后我们将每个列表拆分为:
(Mike|Monika)
我们再次过滤掉所有可能的空字符串以获得最终结果。
所有这一切的关键在于,当我们拆分的正则表达式包含一个捕获组时,该捕获组的文本也会 returned 作为结果列表的一部分.
更新
您没有指定如果输入文本不包含其中一个姓名,应该发生什么。假设您 可能 想要忽略所有字符串,直到找到其中一个名称,然后查看以下版本。同样,如果文本不包含任何名称,则更新后的代码将只是 return 一个空列表:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex0 = re.compile('(' + joined_names + ')[\s\S]*')
m = regex0.search(text)
if not m:
return []
text = m.group(0)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
打印:
[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
针对正则表达式,您还可以将 text 重构为合适的格式,这将通过 split
方法获得预期结果。并添加一些字符串格式处理。
# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
my_sprt = '|'
tmp_text_arr = text.split()
for i in range(len(tmp_text_arr)):
for sprt in names:
if sprt == tmp_text_arr[i]:
tmp_text_arr[i] = my_sprt + sprt + my_sprt
tmp_text = ' '.join(tmp_text_arr)
if tmp_text.startswith(my_sprt):
tmp_text = tmp_text[1:]
tmp_text_arr = tmp_text.split(my_sprt)
if tmp_text_arr[0] not in names:
tmp_text_arr.pop(0)
out_arr = []
for i in range(0, len(tmp_text_arr) - 1, 2):
out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
return out_arr
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
此代码将 兼容 与 text 不以 names 中的元素开头。
关键点:将 text 值重新格式化为 |Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me.
并使用 self-define 分隔符,例如 |
,这不应出现在原始 [=20] 中=]文字.
您可以使用 re.split
along with zip
:
import re
from pprint import pprint
text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
pprint(result)
输出:
[['Monika', ' goes shopping. Then she rides bike.'],
['Mike', ' likes Pizza.'],
['Monika', ' hates me.']]
备注:
这是对问题 revision 9 的回答。
- 考虑到问题 revision 11 的变化,在此答案的最后有一个更新。
您没有指定是否应考虑名称第一次出现之前的“文本”。
- 上面的脚本忽略第一次出现之前的“文本”。
您也没有指定如果文本以名称结尾会发生什么。
- 上面的脚本将通过添加空字符串来包含事件。但是,如果“文本”是空字符串,可以通过删除最后一个元素轻松解决。
zip
有效,因为 fragments
中的元素数量总是偶数。如果第一个元素与名称(文本或空字符串)不匹配,我们将删除它,如果文本以名称结尾,则最后一个元素始终为空字符串。
根据re.split
:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string [...]
这是相同的示例,但没有忽略第一次出现之前的“文本”:
import re
text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# not ignoring text before first occurrence; use empty string as name
if fragments[0].strip() == "":
fragments = fragments[1:]
elif not fragments[0] in names:
fragments = [""] + fragments
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
# # remove empty text
# if result and not result[-1][1]:
# result = result[:-1]
print(result) # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]
备注:
- 这是对问题 revision 9 的回答。
- 考虑到问题 revision 11 的变化,在此答案的最后有一个更新。
问题更新Revision 11
在尝试包含 id345678 附加要求之后:
import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, name_part = name_components
return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
"Mrs. Monika hates me. Henry needs a break."
names = ["Henry", "Dr. Mike", "Mrs. Monika"]
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [
[name, text.rstrip()]
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
pprint(result)
输出:
[['Monika', ' goes shopping. Then she rides bike.'],
['Dr. Mike', ' likes Pizza.'],
['Mrs. Monika', ' hates me.'],
['Henry', ' needs a break.']]
备注:
最终的正则表达式字符串是 (Henry|Mike|(Mrs\. )?Monika)
- 例如。
create_regex_string("Mrs. Monika")
创建 (Mrs\. )?Monika
- 它也适用于其他称呼(只要 space 将称呼与姓名分开)
因为我们在正则表达式中引入了额外的分组,fragments
有更多的值
- 因此,我们需要更改带有
zip
的行,因此它是动态的
如果不想result
中的称呼,可以在创建result
时使用name.split()[-1]
:
result = [
[name.split()[-1], text.rstrip()]
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
# [['Monika', ' goes shopping. Then she rides bike.'],
# ['Mike', ' likes Pizza.'],
# ['Monika', ' hates me.'],
# ['Henry', ' needs a break.']]
请注意:我在休息时间更新脚本时并未测试所有用例。如果有问题请告诉我,我会在下班后进行调查。
我采用了您给出的解决方案之一并对其进行了稍微重构。
def split(txt, seps, actual_sep=''):
order = [item for item in txt.split() if item in seps ]
for sep in seps:
txt = txt.replace(sep, actual_sep)
return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print( split(text, names) )
已编辑
解决此处提到的一些极端情况的另一种解决方案。
def split(txt, seps, sep_pack=''):
for sep in seps:
txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
lst = txt.split(sep_pack)
temp = []
idx = 0
for _ in range(len(lst)):
if idx < len(lst):
if lst[idx] in seps:
temp.append( [lst[idx], lst[idx+1]] )
idx+=2
else:
temp.append( ['', lst[idx]] )
idx+=1
return temp
虽然有点丑,希望改进。
这没有 re,除非您明确需要使用它。适用于给定的测试用例。
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
def sep(text, names):
foo = []
new_text = text.split(' ')
for i in new_text:
if i in names:
foo.append(new_text[:new_text.index(i)])
new_text = new_text[new_text.index(i):]
foo.append(new_text)
foo = foo[1:]
new_foo = []
for i in foo:
first, rest = i[0], i[1:]
rest = " ".join(rest)
i = [first, rest]
new_foo.append(i)
print(new_foo)
sep(text, names)
给出输出:
[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]
应该也适用于其他情况..
这与此处的一些答案类似,但更简单。
分为三个步骤:
- 查找所有出现的分隔符
- 拆分剩余文本
- 根据需要将 (1) 和 (2) 的结果组合成列表列表
我们可以结合 (1) 和 (2),但这会使创建列表的列表更加复杂。
import re
def split_on_names(names: list[str], text: str) -> list[list[str]]:
pattern = re.compile("|".join(map(re.escape, names)))
# step 1: find the separators (in order)
separator = pattern.findall(text)
# step 2: split out the text between separators
remainder = list(filter(None, pattern.split(text)))
# at this point, if `remainder` is longer, it's because `text`
# didn't start with a separator. So, we add a blank separator
# to account for the prefix.
if len(remainder) > len(separator):
separator = ["", *separator]
# step 3: reshape the results into a list of lists
return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."
split_on_names(names, text)
# output:
#
# [
# ['', 'Hi '],
# ['Monika', ' goes shopping. Then she rides bike. '],
# ['Mike', ' likes Pizza. '],
# ['Monika', ' hates me.']
# ]
我有一个 string text
和一个 list names
- 我想拆分
text
每次出现names
的一个元素.
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
期望输出:
output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
常见问题解答
text
并不总是以names
元素开头。感谢 VictorLee 指出这一点。我不关心那个主要部分,但其他人可能会关心,所以感谢人们回答“两种情况”- 分隔符在
names
中的顺序独立于它们在text
中的出现。 names
中的分隔符 是唯一的,但可以在整个text
中出现多次。因此输出的 lists 多于names
的 strings.text
永远不会有相同的唯一names
元素连续出现两次/<>。- 最终我希望输出是一个列表列表,其中每个分割
text
切片对应于它的分隔符,它被分裂了。列表顺序无关紧要。
re.split()
不允许我使用列表作为分隔符参数。我可以 re.compile()
我的分隔符列表吗?
更新: Thomas 代码最适合我的情况,但我注意到一个我以前没有意识到的警告:
names
的某些元素前面有 'Mrs.' 或 'Mr.',而 text
中只有一些相应的匹配项前面有 'Mrs.'或 'Mr.'
到目前为止:
names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, *name = name_components
return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [[name, clist.rstrip()] for name, clist in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
) if clist is not None
]
print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]
错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [86], in <module>
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in <genexpr>(.0)
111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
115 group_count = regex_string.count("(") + 1
116 fragments = re.split(f"({regex_string})", clist)
Input In [86], in create_regex_string(name)
109 if len(name_components) == 1:
110 return re.escape(name)
--> 111 salutation, *name = name_components
112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
ValueError: not enough values to unpack (expected at least 1, got 0)
您的示例与您想要的输出不完全匹配。此外,尚不清楚示例输入是否 总是 具有此结构,例如在每个句子的末尾加上句点。
话虽如此,您可能想尝试这种肮脏的方法:
import re
text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'
names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split
output = []
sentences = text.split(".")
for name in names:
for sentence in sentences:
if name in sentence:
output.append([name, f"{rsplit(sentence)[-1]}."])
print(output)
这输出:
[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]
如果您正在寻找使用正则表达式的方法,那么:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
打印:
[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
说明
首先我们根据过去的 names 参数动态创建一个正则表达式 regex1
为:
(?=Mike|Monika)
当你拆分输入时,因为任何传递的名称都可能出现在输入的开头或结尾,你最终可能会在结果中得到空字符串,因此我们将过滤掉它们并得到:
['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']
然后我们将每个列表拆分为:
(Mike|Monika)
我们再次过滤掉所有可能的空字符串以获得最终结果。
所有这一切的关键在于,当我们拆分的正则表达式包含一个捕获组时,该捕获组的文本也会 returned 作为结果列表的一部分.
更新
您没有指定如果输入文本不包含其中一个姓名,应该发生什么。假设您 可能 想要忽略所有字符串,直到找到其中一个名称,然后查看以下版本。同样,如果文本不包含任何名称,则更新后的代码将只是 return 一个空列表:
import re
def do_split(text, names):
joined_names = '|'.join(re.escape(name) for name in names)
regex0 = re.compile('(' + joined_names + ')[\s\S]*')
m = regex0.search(text)
if not m:
return []
text = m.group(0)
regex1 = re.compile('(?=' + joined_names + ')')
strings = filter(lambda s: s != '', regex1.split(text))
regex2 = re.compile('(' + joined_names + ')')
return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]
text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
打印:
[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]
针对正则表达式,您还可以将 text 重构为合适的格式,这将通过 split
方法获得预期结果。并添加一些字符串格式处理。
# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
my_sprt = '|'
tmp_text_arr = text.split()
for i in range(len(tmp_text_arr)):
for sprt in names:
if sprt == tmp_text_arr[i]:
tmp_text_arr[i] = my_sprt + sprt + my_sprt
tmp_text = ' '.join(tmp_text_arr)
if tmp_text.startswith(my_sprt):
tmp_text = tmp_text[1:]
tmp_text_arr = tmp_text.split(my_sprt)
if tmp_text_arr[0] not in names:
tmp_text_arr.pop(0)
out_arr = []
for i in range(0, len(tmp_text_arr) - 1, 2):
out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
return out_arr
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))
此代码将 兼容 与 text 不以 names 中的元素开头。
关键点:将 text 值重新格式化为 |Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me.
并使用 self-define 分隔符,例如 |
,这不应出现在原始 [=20] 中=]文字.
您可以使用 re.split
along with zip
:
import re
from pprint import pprint
text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
pprint(result)
输出:
[['Monika', ' goes shopping. Then she rides bike.'],
['Mike', ' likes Pizza.'],
['Monika', ' hates me.']]
备注:
这是对问题 revision 9 的回答。
- 考虑到问题 revision 11 的变化,在此答案的最后有一个更新。
您没有指定是否应考虑名称第一次出现之前的“文本”。
- 上面的脚本忽略第一次出现之前的“文本”。
您也没有指定如果文本以名称结尾会发生什么。
- 上面的脚本将通过添加空字符串来包含事件。但是,如果“文本”是空字符串,可以通过删除最后一个元素轻松解决。
zip
有效,因为fragments
中的元素数量总是偶数。如果第一个元素与名称(文本或空字符串)不匹配,我们将删除它,如果文本以名称结尾,则最后一个元素始终为空字符串。
根据re.split
:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string [...]
这是相同的示例,但没有忽略第一次出现之前的“文本”:
import re
text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
"Monika hates me."
names = ["Henry", "Mike", "Monika"]
regex_string = "|".join(re.escape(name) for name in names)
fragments = re.split(f"({regex_string})", text)
if fragments:
# not ignoring text before first occurrence; use empty string as name
if fragments[0].strip() == "":
fragments = fragments[1:]
elif not fragments[0] in names:
fragments = [""] + fragments
result = [
[name, text.rstrip()]
for name, text in zip(fragments[::2], fragments[1::2])
]
# # remove empty text
# if result and not result[-1][1]:
# result = result[:-1]
print(result) # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]
备注:
- 这是对问题 revision 9 的回答。
- 考虑到问题 revision 11 的变化,在此答案的最后有一个更新。
问题更新Revision 11
在尝试包含 id345678 附加要求之后:
import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:
name_components = name.split()
if len(name_components) == 1:
return re.escape(name)
salutation, name_part = name_components
return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
"Mrs. Monika hates me. Henry needs a break."
names = ["Henry", "Dr. Mike", "Mrs. Monika"]
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
# ignoring text before first occurrence, not specified in requirements
if not fragments[0] in names:
fragments = fragments[1:]
result = [
[name, text.rstrip()]
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
pprint(result)
输出:
[['Monika', ' goes shopping. Then she rides bike.'],
['Dr. Mike', ' likes Pizza.'],
['Mrs. Monika', ' hates me.'],
['Henry', ' needs a break.']]
备注:
最终的正则表达式字符串是
(Henry|Mike|(Mrs\. )?Monika)
- 例如。
create_regex_string("Mrs. Monika")
创建(Mrs\. )?Monika
- 它也适用于其他称呼(只要 space 将称呼与姓名分开)
- 例如。
因为我们在正则表达式中引入了额外的分组,
fragments
有更多的值- 因此,我们需要更改带有
zip
的行,因此它是动态的
- 因此,我们需要更改带有
如果不想
result
中的称呼,可以在创建result
时使用name.split()[-1]
:
result = [
[name.split()[-1], text.rstrip()]
for name, text in zip(
fragments[::group_count+1],
fragments[group_count::group_count+1]
)
]
# [['Monika', ' goes shopping. Then she rides bike.'],
# ['Mike', ' likes Pizza.'],
# ['Monika', ' hates me.'],
# ['Henry', ' needs a break.']]
请注意:我在休息时间更新脚本时并未测试所有用例。如果有问题请告诉我,我会在下班后进行调查。
我采用了您给出的解决方案之一并对其进行了稍微重构。
def split(txt, seps, actual_sep=''):
order = [item for item in txt.split() if item in seps ]
for sep in seps:
txt = txt.replace(sep, actual_sep)
return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print( split(text, names) )
已编辑
解决此处提到的一些极端情况的另一种解决方案。
def split(txt, seps, sep_pack=''):
for sep in seps:
txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
lst = txt.split(sep_pack)
temp = []
idx = 0
for _ in range(len(lst)):
if idx < len(lst):
if lst[idx] in seps:
temp.append( [lst[idx], lst[idx+1]] )
idx+=2
else:
temp.append( ['', lst[idx]] )
idx+=1
return temp
虽然有点丑,希望改进。
这没有 re,除非您明确需要使用它。适用于给定的测试用例。
text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
def sep(text, names):
foo = []
new_text = text.split(' ')
for i in new_text:
if i in names:
foo.append(new_text[:new_text.index(i)])
new_text = new_text[new_text.index(i):]
foo.append(new_text)
foo = foo[1:]
new_foo = []
for i in foo:
first, rest = i[0], i[1:]
rest = " ".join(rest)
i = [first, rest]
new_foo.append(i)
print(new_foo)
sep(text, names)
给出输出:
[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]
应该也适用于其他情况..
这与此处的一些答案类似,但更简单。
分为三个步骤:
- 查找所有出现的分隔符
- 拆分剩余文本
- 根据需要将 (1) 和 (2) 的结果组合成列表列表
我们可以结合 (1) 和 (2),但这会使创建列表的列表更加复杂。
import re
def split_on_names(names: list[str], text: str) -> list[list[str]]:
pattern = re.compile("|".join(map(re.escape, names)))
# step 1: find the separators (in order)
separator = pattern.findall(text)
# step 2: split out the text between separators
remainder = list(filter(None, pattern.split(text)))
# at this point, if `remainder` is longer, it's because `text`
# didn't start with a separator. So, we add a blank separator
# to account for the prefix.
if len(remainder) > len(separator):
separator = ["", *separator]
# step 3: reshape the results into a list of lists
return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."
split_on_names(names, text)
# output:
#
# [
# ['', 'Hi '],
# ['Monika', ' goes shopping. Then she rides bike. '],
# ['Mike', ' likes Pizza. '],
# ['Monika', ' hates me.']
# ]