按字符串的元素拆分，并创建一个字典，其中包含{element used to split: that chunk of text}

Question

考虑以下文本：

"Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"

以及要拆分的单词列表：

["McCONNELL", "PRESIDING OFFICER", "REID"]

我想让输出成为字典

{"McCONNELL": "yadda yadd jon stewart is mean to me. but noooo.", 
"PRESIDING OFFICER": "Suck it up. "
"REID": "Really dude?"}

所以我需要一种方法来按列表的元素（在任何这些名称上）进行拆分，然后知道它拆分的是哪个，并能够将其映射到该拆分中的文本块。如果多个文本块具有相同的说话者（"McCONNELL"，在示例中），只需连接字符串即可。

编辑：这是我一直在使用的功能。它适用于该示例，但当我在更大范围内尝试时它并不健壮（并且不清楚为什么它会搞砸）

def split_by_speaker(txt, seps):
    '''
    Given raw text and a list of separators (generally possible speaker names), splits based 
    on those names and returns a dictionary of text attributable to that name 
    '''
    speakers = []
    default_sep = seps[0]
    rv = {}

    for sep in seps:
        if sep in txt: 
            all_occurences = [m.start() for m in re.finditer(sep, txt)]
            for occ in all_occurences: 
                speakers.append((sep, occ))

            txt = txt.replace(sep, default_sep)
    temp_t = [i.strip() for i in txt.split(default_sep)][1:]
    speakers.sort(key = lambda x: x[1])
    for i in range(len(temp_t)): 
        if speakers[i][0] in rv: 
            rv[speakers[i][0]] = rv[speakers[i][0]] + " " + temp_t[i]
        else: 
            rv[speakers[i][0]] = temp_t[i]
    return rv

Answer 1

使用标准库中的 re module 来定义拆分。提示：拆分 "separator" - 正则表达式 - 可以是以下形式：(WORD1|WORD2|WORD3)

参见 these examples re.split 的结果。

import re

text = "Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"

speakers = ["McCONNELL", "PRESIDING OFFICER", "REID"]

speakers_re = re.compile('(' + '|'.join([re.escape(s) for s in speakers]) + ')')

print speakers_re.split(text)

结果：

['Mr. ', 'McCONNELL', 
 '. yadda yadda jon stewart is mean to me. The ', 
 'PRESIDING OFFICER', '. Suck it up. Mr. ', 
 'McCONNELL', '. but noooo. Mr. ', 'REID', '. Really dude?']

也可以通过正则表达式或字符串的简单 .rstrip() 和 .lstrip() 方法删除不必要的标点符号。

按字符串的元素拆分，并创建一个字典，其中包含{element used to split: that chunk of text}

split by elements of a string, and create a dictionary with {element used to split: that chunk of text}

python

split

text-analysis

python-2.7