从通话记录中提取个人言语行为
Extract individual speech acts from call transcript
我有通话记录数据如下:
'[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone.
[0:00:10] spk1 : sure, let me know the issue'
我想要 spk1
的文本数据与 spk2
分开。
我试过了
import re
text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
m = re.search('\](.+?)\[', text)
if m:
found = m.group
found
但我没有得到答案。
假设你想保留顺序、时间、演讲者信息并允许一些相对动态的顺序(灵活的演讲者数量,允许同一演讲者连续两个时间戳或更多时间发言):
import re
text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
conversation_dict_list = []
# iterate over tokens split by whitespaces
for token in text.split():
# timestamp: add new dict to list, add time and empty speaker and empty text
if re.fullmatch("\[\d+:\d\d:\d\d\]", token):
conversation_dict_list.append({"time": token[1:-1], "speaker": None, "text": ""})
# speaker: fill speaker field
elif re.fullmatch("spk\d+", token):
conversation_dict_list[-1]["speaker"] = token
# text: keep concatenating to text field (plus whitespace)
else:
conversation_dict_list[-1]["text"] += " " + token
# remove leading " : " from each text
conversation_dict_list = [{k_:(v_ if k_ != "text" else v_[3:]) for k_,v_ in d.items()} for d in conversation_dict_list]
print(conversation_dict_list)
哪个returns:
> [{'time': '0:00:00', 'speaker': 'spk1', 'text': 'Hi how are you'}, {'time': '0:00:02', 'speaker': 'spk2', 'text': 'I am good, need help on my phone.'}, {'time': '0:00:10', 'speaker': 'spk1', 'text': 'sure, let me know the issue'}]
显然,这只有在您始终拥有准确模式的情况下才有效 [h:mm:ss] spkX
,因为如果您拥有例如同一时间戳内有多个发言者,发言者将被最后一个覆盖。
我有通话记录数据如下:
'[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone.
[0:00:10] spk1 : sure, let me know the issue'
我想要 spk1
的文本数据与 spk2
分开。
我试过了
import re
text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
m = re.search('\](.+?)\[', text)
if m:
found = m.group
found
但我没有得到答案。
假设你想保留顺序、时间、演讲者信息并允许一些相对动态的顺序(灵活的演讲者数量,允许同一演讲者连续两个时间戳或更多时间发言):
import re
text = "[0:00:00] spk1 : Hi how are you [0:00:02] spk2 : I am good, need help on my phone. [0:00:10] spk1 : sure, let me know the issue"
conversation_dict_list = []
# iterate over tokens split by whitespaces
for token in text.split():
# timestamp: add new dict to list, add time and empty speaker and empty text
if re.fullmatch("\[\d+:\d\d:\d\d\]", token):
conversation_dict_list.append({"time": token[1:-1], "speaker": None, "text": ""})
# speaker: fill speaker field
elif re.fullmatch("spk\d+", token):
conversation_dict_list[-1]["speaker"] = token
# text: keep concatenating to text field (plus whitespace)
else:
conversation_dict_list[-1]["text"] += " " + token
# remove leading " : " from each text
conversation_dict_list = [{k_:(v_ if k_ != "text" else v_[3:]) for k_,v_ in d.items()} for d in conversation_dict_list]
print(conversation_dict_list)
哪个returns:
> [{'time': '0:00:00', 'speaker': 'spk1', 'text': 'Hi how are you'}, {'time': '0:00:02', 'speaker': 'spk2', 'text': 'I am good, need help on my phone.'}, {'time': '0:00:10', 'speaker': 'spk1', 'text': 'sure, let me know the issue'}]
显然,这只有在您始终拥有准确模式的情况下才有效 [h:mm:ss] spkX
,因为如果您拥有例如同一时间戳内有多个发言者,发言者将被最后一个覆盖。