按字符串的元素拆分,并创建一个字典,其中包含{element used to split: that chunk of text}
split by elements of a string, and create a dictionary with {element used to split: that chunk of text}
考虑以下文本:
"Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"
以及要拆分的单词列表:
["McCONNELL", "PRESIDING OFFICER", "REID"]
我想让输出成为字典
{"McCONNELL": "yadda yadd jon stewart is mean to me. but noooo.",
"PRESIDING OFFICER": "Suck it up. "
"REID": "Really dude?"}
所以我需要一种方法来按列表的元素(在任何这些名称上)进行拆分,然后知道它拆分的是哪个,并能够将其映射到该拆分中的文本块。如果多个文本块具有相同的说话者("McCONNELL",在示例中),只需连接字符串即可。
编辑:这是我一直在使用的功能。它适用于该示例,但当我在更大范围内尝试时它并不健壮(并且不清楚为什么它会搞砸)
def split_by_speaker(txt, seps):
'''
Given raw text and a list of separators (generally possible speaker names), splits based
on those names and returns a dictionary of text attributable to that name
'''
speakers = []
default_sep = seps[0]
rv = {}
for sep in seps:
if sep in txt:
all_occurences = [m.start() for m in re.finditer(sep, txt)]
for occ in all_occurences:
speakers.append((sep, occ))
txt = txt.replace(sep, default_sep)
temp_t = [i.strip() for i in txt.split(default_sep)][1:]
speakers.sort(key = lambda x: x[1])
for i in range(len(temp_t)):
if speakers[i][0] in rv:
rv[speakers[i][0]] = rv[speakers[i][0]] + " " + temp_t[i]
else:
rv[speakers[i][0]] = temp_t[i]
return rv
使用标准库中的 re module 来定义拆分。提示:拆分 "separator" - 正则表达式 - 可以是以下形式:(WORD1|WORD2|WORD3)
参见 these examples re.split 的结果。
import re
text = "Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"
speakers = ["McCONNELL", "PRESIDING OFFICER", "REID"]
speakers_re = re.compile('(' + '|'.join([re.escape(s) for s in speakers]) + ')')
print speakers_re.split(text)
结果:
['Mr. ', 'McCONNELL',
'. yadda yadda jon stewart is mean to me. The ',
'PRESIDING OFFICER', '. Suck it up. Mr. ',
'McCONNELL', '. but noooo. Mr. ', 'REID', '. Really dude?']
也可以通过正则表达式或字符串的简单 .rstrip() 和 .lstrip() 方法删除不必要的标点符号。
考虑以下文本:
"Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"
以及要拆分的单词列表:
["McCONNELL", "PRESIDING OFFICER", "REID"]
我想让输出成为字典
{"McCONNELL": "yadda yadd jon stewart is mean to me. but noooo.",
"PRESIDING OFFICER": "Suck it up. "
"REID": "Really dude?"}
所以我需要一种方法来按列表的元素(在任何这些名称上)进行拆分,然后知道它拆分的是哪个,并能够将其映射到该拆分中的文本块。如果多个文本块具有相同的说话者("McCONNELL",在示例中),只需连接字符串即可。
编辑:这是我一直在使用的功能。它适用于该示例,但当我在更大范围内尝试时它并不健壮(并且不清楚为什么它会搞砸)
def split_by_speaker(txt, seps):
'''
Given raw text and a list of separators (generally possible speaker names), splits based
on those names and returns a dictionary of text attributable to that name
'''
speakers = []
default_sep = seps[0]
rv = {}
for sep in seps:
if sep in txt:
all_occurences = [m.start() for m in re.finditer(sep, txt)]
for occ in all_occurences:
speakers.append((sep, occ))
txt = txt.replace(sep, default_sep)
temp_t = [i.strip() for i in txt.split(default_sep)][1:]
speakers.sort(key = lambda x: x[1])
for i in range(len(temp_t)):
if speakers[i][0] in rv:
rv[speakers[i][0]] = rv[speakers[i][0]] + " " + temp_t[i]
else:
rv[speakers[i][0]] = temp_t[i]
return rv
使用标准库中的 re module 来定义拆分。提示:拆分 "separator" - 正则表达式 - 可以是以下形式:(WORD1|WORD2|WORD3)
参见 these examples re.split 的结果。
import re
text = "Mr. McCONNELL. yadda yadda jon stewart is mean to me. The PRESIDING OFFICER. Suck it up. Mr. McCONNELL. but noooo. Mr. REID. Really dude?"
speakers = ["McCONNELL", "PRESIDING OFFICER", "REID"]
speakers_re = re.compile('(' + '|'.join([re.escape(s) for s in speakers]) + ')')
print speakers_re.split(text)
结果:
['Mr. ', 'McCONNELL',
'. yadda yadda jon stewart is mean to me. The ',
'PRESIDING OFFICER', '. Suck it up. Mr. ',
'McCONNELL', '. but noooo. Mr. ', 'REID', '. Really dude?']
也可以通过正则表达式或字符串的简单 .rstrip() 和 .lstrip() 方法删除不必要的标点符号。