拆分包含多个子字符串的字符串

split strings that contain more than one substring

我有一个字符串列表names

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

我想拆分包含 多个 以下子字符串的字符串:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

更准确地说,我想在子字符串

后面的单词的最后一个字符之后拆分

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

我不知道如何在我的代码中实现 'more than one' 条件:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

例外:当最后一个字符是一个点时(例如 Prof.),在子字符串后面的 第二个 单词之后拆分。


更新: names 比我想象的要复杂并遵循

  1. title-like-pattern已经回答正确('Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose')
  2. 直到出现第二个字符串模式 ('Mister Kelly, AWS')
  3. 直到出现第三个字符串模式直到结束 ('Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary')

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

有时 Secretary 后跟不同的规格。在下一个名字出现之前,我不关心这些有时跟在 Secretary 后面的字符。它们可以被丢弃。当然 'Secretary' 应该存储在 updated_output.

我创建了一个——希望是详尽的——列表specifications,其中包含 Secretary 之后的内容。这是列表的表示形式: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

更新问题:我如何使用 specification 列表来解释第三个模式?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

尝试:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

打印:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

您想在这三个标题之一之前的 词边界 处拆分,因此您可以查找词边界 \b 后跟正前瞻 (?=...) 对于其中一个标题:

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

然后,您可以 trim 并丢弃空结果:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

有了输入字符串列表,只需将此处理应用于所有字符串即可:

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

给出:

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']
names = [
    'Acquaintance Muller',
    'Vice president Johnson ' 'Affiliate Peterson ' 'Acquaintance Dr. Rose',
    'Vice president Dr. John ' 'Mister Schmid, PRT ' 'Miss Robertson, FDU',
    'Mister Kelly, AWS',
    'Dr. Birker, Secretary of State '
    'Dr. Dews, Member '
    'Miss Berg, Secretary for Relations '
    'Dr. Jakob, Secretary',
]
substrings = [
    'Vice president',
    'Affiliate',
    'Acquaintance',
    'Dr.',
    'Miss',
    'Mister',
]

updated_output = [
    'Acquaintance Muller',
    'Vice president Johnson',
    'Affiliate Peterson',
    'Acquaintance Dr. Rose',
    'Vice president Dr. John',
    'Mister Schmid, PRT',
    'Miss Robertson, FDU',
    'Mister Kelly, AWS',
    'Dr. Birker, Secretary of State',
    'Dr. Dews, Member',
    'Miss Berg, Secretary for Relations',
    'Dr. Jakob, Secretary',
]


def split_by_substrings(string):
    indexes = [string.find(x, 1) for x in substrings]
    indexes = sorted(x for x in indexes if x > 0)

    if not indexes:
        return string.strip(), ''

    i = indexes[0]
    r = string[:i].strip()
    if r not in substrings:
        return r, string[i:]

    indexes = indexes[1:]
    if indexes:
        i = indexes[0]
        r = string[:i].strip()
        return r, string[i:]

    return string.strip(), ''


result = []
for name in names:
    while name:
        r, name = split_by_substrings(name)
        result.append(r)


assert updated_output == result