从通话记录中提取个人言语行为

Extract individual speech acts from call transcript

我有通话记录数据如下:

'[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. 
[0:00:10] spk1 : sure, let me know the issue'

我想要 spk1 的文本数据与 spk2 分开。

我试过了

import re

text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"

m = re.search('\](.+?)\[', text)
if m:
    found = m.group
found

但我没有得到答案。

假设你想保留顺序、时间、演讲者信息并允许一些相对动态的顺序(灵活的演讲者数量,允许同一演讲者连续两个时间戳或更多时间发言):

import re

text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"

conversation_dict_list = []
# iterate over tokens split by whitespaces
for token in text.split(): 
    # timestamp: add new dict to list, add time and empty speaker and empty text 
    if re.fullmatch("\[\d+:\d\d:\d\d\]", token):
        conversation_dict_list.append({"time": token[1:-1], "speaker": None, "text": ""})
    # speaker: fill speaker field
    elif re.fullmatch("spk\d+", token):
        conversation_dict_list[-1]["speaker"] = token
    # text: keep concatenating to text field (plus whitespace)
    else:  
        conversation_dict_list[-1]["text"] += " " + token

# remove leading " : " from each text
conversation_dict_list = [{k_:(v_ if k_ != "text" else v_[3:]) for k_,v_ in d.items()} for d in conversation_dict_list]

print(conversation_dict_list)

哪个returns:

> [{'time': '0:00:00', 'speaker': 'spk1', 'text': 'Hi how are you'}, {'time': '0:00:02', 'speaker': 'spk2', 'text': 'I am good, need help on my phone.'}, {'time': '0:00:10', 'speaker': 'spk1', 'text': 'sure, let me know the issue'}]

显然,这只有在您始终拥有准确模式的情况下才有效 [h:mm:ss] spkX,因为如果您拥有例如同一时间戳内有多个发言者,发言者将被最后一个覆盖。