按字符串的元素拆分,并创建一个字典,其中包含{element used to split: that chunk of text}

split by elements of a string, and create a dictionary with {element used to split: that chunk of text}

考虑以下文本:

"Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?" 

以及要拆分的单词列表:

["McCONNELL", "PRESIDING OFFICER", "REID"]

我想让输出成为字典

{"McCONNELL": "yadda yadd jon stewart is mean to me. but noooo.", 
"PRESIDING OFFICER": "Suck it up. "
"REID": "Really dude?"}

所以我需要一种方法来按列表的元素(在任何这些名称上)进行拆分,然后知道它拆分的是哪个,并能够将其映射到该拆分中的文本块。如果多个文本块具有相同的说话者("McCONNELL",在示例中),只需连接字符串即可。

编辑:这是我一直在使用的功能。它适用于该示例,但当我在更大范围内尝试时它并不健壮(并且不清楚为什么它会搞砸)

def split_by_speaker(txt, seps):
    '''
    Given raw text and a list of separators (generally possible speaker names), splits based 
    on those names and returns a dictionary of text attributable to that name 
    '''
    speakers = []
    default_sep = seps[0]
    rv = {}

    for sep in seps:
        if sep in txt: 
            all_occurences = [m.start() for m in re.finditer(sep, txt)]
            for occ in all_occurences: 
                speakers.append((sep, occ))

            txt = txt.replace(sep, default_sep)
    temp_t = [i.strip() for i in txt.split(default_sep)][1:]
    speakers.sort(key = lambda x: x[1])
    for i in range(len(temp_t)): 
        if speakers[i][0] in rv: 
            rv[speakers[i][0]] = rv[speakers[i][0]] + " " + temp_t[i]
        else: 
            rv[speakers[i][0]] = temp_t[i]
    return rv 

使用标准库中的 re module 来定义拆分。提示:拆分 "separator" - 正则表达式 - 可以是以下形式:(WORD1|WORD2|WORD3)

参见 these examples re.split 的结果。

import re

text = "Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"

speakers = ["McCONNELL", "PRESIDING OFFICER", "REID"]

speakers_re = re.compile('(' + '|'.join([re.escape(s) for s in speakers]) + ')')

print speakers_re.split(text)

结果:

['Mr. ', 'McCONNELL', 
 '. yadda yadda jon stewart is mean to me. The ', 
 'PRESIDING OFFICER', '. Suck it up. Mr. ', 
 'McCONNELL', '. but noooo. Mr. ', 'REID', '. Really dude?']

也可以通过正则表达式或字符串的简单 .rstrip() 和 .lstrip() 方法删除不必要的标点符号。