无论顺序如何,按分隔符列表拆分字符串

splitting strings by list of separators irrespective of order

我有一个 string text 和一个 list names

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

期望输出:

output = [['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

常见问题解答

re.split() 不允许我使用列表作为分隔符参数。我可以 re.compile() 我的分隔符列表吗?


更新: Thomas 代码最适合我的情况,但我注意到一个我以前没有意识到的警告:

names 的某些元素前面有 'Mrs.' 或 'Mr.',而 text 中只有一些相应的匹配项前面有 'Mrs.'或 'Mr.'


到目前为止:

names = ['Mr. Mike, ADS', 'Monika, TFO', 'Peter, WQR']
text1 = ['Mrs. Monika, TFO goes shopping. Then she rides bike. Mike, ADS likes Pizza. Monika, TFO hates me.']
text = str(text1)[1:-1]

def create_regex_string(name: List[str]) -> str:
    name_components = name.split()
    if len(name_components) == 1:
        return re.escape(name)
    salutation, *name = name_components
    return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
    
regex_string = "|".join(create_regex_string(name) for name in names)
group_count = regex_string.count("(") + 1
fragments = re.split(f"({regex_string})", text)
if fragments:
    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]
        result = [[name, clist.rstrip()] for name, clist in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        ) if clist is not None
    ]

print(result)
[['Monika, TFO', ' goes shopping. Then she rides bike.'], ['Mike, ADS', ' likes Pizza.'], ['Monika, TFO', " hates me.'"]]

错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [86], in <module>
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in <genexpr>(.0)
    111     salutation, *name = name_components
    112     return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"
--> 114 regex_string = "|".join(create_regex_string(name) for name in mps)
    115 group_count = regex_string.count("(") + 1
    116 fragments = re.split(f"({regex_string})", clist)

Input In [86], in create_regex_string(name)
    109 if len(name_components) == 1:
    110     return re.escape(name)
--> 111 salutation, *name = name_components
    112 return f"({re.escape(salutation)} )?{re.escape(' '.join(name))}"

ValueError: not enough values to unpack (expected at least 1, got 0)

您的示例与您想要的输出不完全匹配。此外,尚不清楚示例输入是否 总是 具有此结构,例如在每个句子的末尾加上句点。

话虽如此,您可能想尝试这种肮脏的方法:

import re

text = 'Monika will go shopping. Mike likes Pizza. Monika hates me.'

names = ['Ruth', 'Mike', 'Monika']
rsplit = re.compile("|".join(sorted(names))).split

output = []
sentences = text.split(".")
for name in names:
    for sentence in sentences:
        if name in sentence:
            output.append([name, f"{rsplit(sentence)[-1]}."])

print(output)

这输出:

[['Mike', ' likes Pizza.'], ['Monika', ' will go shopping.'], ['Monika', ' hates me.']]

如果您正在寻找使用正则表达式的方法,那么:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

打印:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

说明

首先我们根据过去的 names 参数动态创建一个正则表达式 regex1 为:

(?=Mike|Monika)

当你拆分输入时,因为任何传递的名称都可能出现在输入的开头或结尾,你最终可能会在结果中得到空字符串,因此我们将过滤掉它们并得到:

['Monika goes shopping. Then she rides bike. ', 'Mike likes Pizza. ', 'Monika hates me.']

然后我们将每个列表拆分为:

(Mike|Monika)

我们再次过滤掉所有可能的空字符串以获得最终结果。

所有这一切的关键在于,当我们拆分的正则表达式包含一个捕获组时,该捕获组的文本也会 returned 作为结果列表的一部分.

更新

您没有指定如果输入文本不包含其中一个姓名,应该发生什么。假设您 可能 想要忽略所有字符串,直到找到其中一个名称,然后查看以下版本。同样,如果文本不包含任何名称,则更新后的代码将只是 return 一个空列表:

import re

def do_split(text, names):
    joined_names = '|'.join(re.escape(name) for name in names)

    regex0 = re.compile('(' + joined_names + ')[\s\S]*')
    m = regex0.search(text)
    if not m:
        return []
    text = m.group(0)

    regex1 = re.compile('(?=' + joined_names + ')')
    strings = filter(lambda s: s != '', regex1.split(text))

    regex2 = re.compile('(' + joined_names + ')')
    return [list(filter(lambda s: s != '', regex2.split(s.rstrip()))) for s in strings]

text = 'I think Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

打印:

[['Monika', ' goes shopping. Then she rides bike.'], ['Mike', ' likes Pizza.'], ['Monika', ' hates me.']]

针对正则表达式,您还可以将 text 重构为合适的格式,这将通过 split 方法获得预期结果。并添加一些字符串格式处理。

# works on python2 or python3, but the time complexity is O(n2) means n*n
def do_split(text, names):
    my_sprt = '|'
    tmp_text_arr = text.split()
    for i in range(len(tmp_text_arr)):
        for sprt in names:
            if sprt == tmp_text_arr[i]:
                tmp_text_arr[i] = my_sprt + sprt + my_sprt

    tmp_text = ' '.join(tmp_text_arr)
    if tmp_text.startswith(my_sprt):
        tmp_text = tmp_text[1:]

    tmp_text_arr = tmp_text.split(my_sprt)
    if tmp_text_arr[0] not in names:
        tmp_text_arr.pop(0)

    out_arr = []
    for i in range(0, len(tmp_text_arr) - 1, 2):
        out_arr.append([tmp_text_arr[i], tmp_text_arr[i + 1].rstrip()])
    return out_arr

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
text = 'today Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']
print(do_split(text, names))

此代码将 兼容text 不以 names 中的元素开头。

关键点:将 text 值重新格式化为 |Monika| goes shopping. Then she rides bike. |Mike| likes Pizza. |Monika| hates me. 并使用 self-define 分隔符,例如 |,这不应出现在原始 [=20] 中=]文字.

您可以使用 re.split along with zip:

import re
from pprint import pprint

text = "Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    pprint(result)

输出:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Mike', ' likes Pizza.'],
 ['Monika', ' hates me.']]

备注:

  • 这是对问题 revision 9 的回答。

    • 考虑到问题 revision 11 的变化,在此答案的最后有一个更新。
  • 您没有指定是否应考虑名称第一次出现之前的“文本”。

    • 上面的脚本忽略第一次出现之前的“文本”。
  • 您也没有指定如果文本以名称结尾会发生什么。

    • 上面的脚本将通过添加空字符串来包含事件。但是,如果“文本”是空字符串,可以通过删除最后一个元素轻松解决。
  • zip 有效,因为 fragments 中的元素数量总是偶数。如果第一个元素与名称(文本或空字符串)不匹配,我们将删除它,如果文本以名称结尾,则最后一个元素始终为空字符串。

根据re.split

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string [...]


这是相同的示例,但没有忽略第一次出现之前的“文本”:

import re

text = "Hi. Monika goes shopping. Then she rides bike. Mike likes Pizza." \
       "Monika hates me."

names = ["Henry", "Mike", "Monika"]

regex_string = "|".join(re.escape(name) for name in names)

fragments = re.split(f"({regex_string})", text)

if fragments:

    # not ignoring text before first occurrence; use empty string as name
    if fragments[0].strip() == "":
        fragments = fragments[1:]
    elif not fragments[0] in names:
        fragments = [""] + fragments

    result = [
        [name, text.rstrip()]
        for name, text in zip(fragments[::2], fragments[1::2])
    ]

    # # remove empty text
    # if result and not result[-1][1]:
    #     result = result[:-1]

    print(result)  # [['', 'Hi.'], ['Monika', ...] ..., ['Monika', ' hates me.']]

备注:

  • 这是对问题 revision 9 的回答。
    • 考虑到问题 revision 11 的变化,在此答案的最后有一个更新。

问题更新Revision 11

在尝试包含 id345678 附加要求之后:

import re
from pprint import pprint
from typing import List
def create_regex_string(name: List[str]) -> str:

    name_components = name.split()

    if len(name_components) == 1:
        return re.escape(name)

    salutation, name_part = name_components

    return f"({re.escape(salutation)} )?{re.escape(name_part)}"
text = "Monika goes shopping. Then she rides bike. Dr. Mike likes Pizza. " \
       "Mrs. Monika hates me. Henry needs a break."

names = ["Henry", "Dr. Mike", "Mrs. Monika"]

regex_string = "|".join(create_regex_string(name) for name in names)

group_count = regex_string.count("(") + 1

fragments = re.split(f"({regex_string})", text)

if fragments:

    # ignoring text before first occurrence, not specified in requirements
    if not fragments[0] in names: 
        fragments = fragments[1:]

    result = [
        [name, text.rstrip()] 
        for name, text in zip(
            fragments[::group_count+1],
            fragments[group_count::group_count+1]
        )
    ]

    pprint(result)

输出:

[['Monika', ' goes shopping. Then she rides bike.'],
 ['Dr. Mike', ' likes Pizza.'],
 ['Mrs. Monika', ' hates me.'],
 ['Henry', ' needs a break.']]

备注:

  • 最终的正则表达式字符串是 (Henry|Mike|(Mrs\. )?Monika)

    • 例如。 create_regex_string("Mrs. Monika") 创建 (Mrs\. )?Monika
    • 它也适用于其他称呼(只要 space 将称呼与姓名分开)
  • 因为我们在正则表达式中引入了额外的分组,fragments 有更多的值

    • 因此,我们需要更改带有 zip 的行,因此它是动态的
  • 如果不想result中的称呼,可以在创建result时使用name.split()[-1]:

result = [
    [name.split()[-1], text.rstrip()] 
    for name, text in zip(
        fragments[::group_count+1],
        fragments[group_count::group_count+1]
    )
]

# [['Monika', ' goes shopping. Then she rides bike.'],
#  ['Mike', ' likes Pizza.'],
#  ['Monika', ' hates me.'],
#  ['Henry', ' needs a break.']]

请注意:我在休息时间更新脚本时并未测试所有用例。如果有问题请告诉我,我会在下班后进行调查。

我采用了您给出的解决方案之一并对其进行了稍微重构。

def split(txt, seps, actual_sep=''):
    order = [item for item in txt.split() if item in seps ]
    for sep in seps:
        txt = txt.replace(sep, actual_sep)
    return list( zip( order, [i.strip() for i in txt.split(actual_sep) if bool(i.strip())] ) )

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'
names = ['Mike', 'Monika']

print( split(text, names) )

已编辑

解决此处提到的一些极端情况的另一种解决方案。

def split(txt, seps, sep_pack=''):
    for sep in seps:
        txt = txt.replace(sep, f"{sep_pack}{sep}{sep_pack}")
    
    lst = txt.split(sep_pack)
    temp = []
    idx = 0
    for _ in range(len(lst)):
        if idx < len(lst):
            if lst[idx] in seps:
                temp.append( [lst[idx], lst[idx+1]] )
                idx+=2
            else:
                temp.append( ['', lst[idx]] )
                idx+=1

    return temp

虽然有点丑,希望改进。

这没有 re,除非您明确需要使用它。适用于给定的测试用例。

text = 'Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me.'

names = ['Mike', 'Monika']

def sep(text, names):
    foo = []
    new_text = text.split(' ')
    for i in new_text:
        if i in names:
            foo.append(new_text[:new_text.index(i)])
            new_text = new_text[new_text.index(i):]
    foo.append(new_text)
    foo = foo[1:]

    new_foo = []
    for i in foo:
        first, rest = i[0], i[1:]
        rest = " ".join(rest)
        i = [first, rest]
        new_foo.append(i)
    print(new_foo)

sep(text, names)

给出输出:

[['Monika', 'goes shopping. Then she rides bike.'], ['Mike', 'likes Pizza.'], ['Monika', 'hates me.']]

应该也适用于其他情况..

这与此处的一些答案类似,但更简单。

分为三个步骤:

  1. 查找所有出现的分隔符
  2. 拆分剩余文本
  3. 根据需要将 (1) 和 (2) 的结果组合成列表列表

我们可以结合 (1) 和 (2),但这会使创建列表的列表更加复杂。

import re

def split_on_names(names: list[str], text: str) -> list[list[str]]:
    pattern = re.compile("|".join(map(re.escape, names)))
    # step 1: find the separators (in order)
    separator = pattern.findall(text)
    # step 2: split out the text between separators
    remainder = list(filter(None, pattern.split(text)))

    # at this point, if `remainder` is longer, it's because `text` 
    # didn't start with a separator. So, we add a blank separator
    # to account for the prefix.
    if len(remainder) > len(separator):
        separator = ["", *separator]

    # step 3: reshape the results into a list of lists
    return list(map(list, zip(separator, remainder)))
names = ["Mike", "Monika"]
text = "Hi Monika goes shopping. Then she rides bike. Mike likes Pizza. Monika hates me."

split_on_names(names, text)

# output:
#
# [
#    ['', 'Hi '],
#    ['Monika', ' goes shopping. Then she rides bike. '],
#    ['Mike', ' likes Pizza. '],
#    ['Monika', ' hates me.']
# ]