字符串分割
String Segmentation
已解决
我有一个字符串,其中包含两个人之间的对话以及他们的说话人标签。
我想将该字符串拆分为两个子字符串,仅包含说话者 1 和说话者 2 的对话。
这是我用来获取成绩单的代码。
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=10000)
result = response.results[-1]
words_info = result.alternatives[0].words
transcript = ''
tag=1
speaker=""
for word_info in words_info:
if word_info.speaker_tag==tag:
speaker=speaker+" "+word_info.word
else:
transcript += "speaker {}: {}".format(tag,speaker) + '\n'
tag=word_info.speaker_tag
speaker=""+word_info.word
transcript += "speaker {}: {}".format(tag,speaker)
这会将说话者 1 和说话者 2 转录到同一个文件中。
已解决:解决方案要简单得多。感谢您的帮助。
transcript_1 = ''
transcript_2 = ''
for word_info in words_info:
if word_info.speaker_tag==1:
#speaker += " "+word_info.word
transcript_1 += " " + word_info.word
elif word_info.speaker_tag==2:
#speaker += " "+word_info.word
transcript_2 += " " + word_info.word
取决于你如何获取数据,我的意思是,如果你得到一个唯一的原始字符串,其中包含来自两个说话者的所有消息,或者你分别从每个说话者那里得到消息。
一种基本方法是建立字符串 "speaker X:"(其中 N 是演讲者编号)作为第一个演讲者的演讲者标签,然后您可以使用 NLTK and/or 内置函数,如 find()。
注意:当我谈论标签时,我指的是一些可以让我们确定消息是否来自某个说话者的表达方式。
示例:
您将获得包含演讲者所有发言的完整文本。
- 要遵循的步骤:
1) 设置所有发言者标签,以区分他们在全文中的发言。
示例:第一个演讲者的演讲者标签可以是 "speaker 1:"
2) 使用 str.find("speaker_tag")
查找演讲者的所有发言
3) 将每个演讲者的所有发言添加到不同的数据结构中。
我认为演讲者的干预列表可能会有用,然后如果你想再次在一条短信中获得所有这些干预,你可以使用
一些内置函数,如 str.join() 再次将它们连接成一个字符串。
解决这个问题的其他选择是使用像 NLTK 这样的工具(我认为这个工具非常适合对文本进行分类)
它具有非常有用的功能,例如标记化,我认为它对解决您的问题很有用。
在下面的示例中,我将使用 find() 和切片作为有关文本标记化的基本示例:
文本数据:
text = "speaker 1: hello everyone, I am Thomas speaker 2: Hello friends, I am John speaker 1: How are you? I am great being here speaker 2: It's the same for me"
代码示例:
from itertools import islice, tee
FIRST_SPEAKER_TAG = "speaker 1:"
SECOND_SPEAKER_TAG = "speaker 2:"
def get_speaker_positions(text, speaker_tag):
total_interventions = text.count(speaker_tag)
positions = []
position = 0
for i in range(total_interventions):
positions.append(text.find(speaker_tag, position))
# we increase the position by the addition of all the previous
# positions to reach the following occurrences through the list of
# positions
position += sum(positions) + 1
return positions
def slices(iterable, n):
return zip(*(islice(it, i, None) for i, it in enumerate(tee(iterable, n))))
def get_text_interventions(text, speaker_tags):
# speakers' interventions of the text
interventions = { speaker_tag: "" for speaker_tag in speaker_tags }
# positions where start each intervention in the text
# (the last one is used to get the rest of the text, because it's the
# last intervention)
# (we need to sort the positions to get the interventions in the correct
# order)
speaker_positions = [
get_speaker_positions(text, speaker) for speaker in speaker_tags
]
all_positions = [
position for sublist in speaker_positions for position in sublist
]
all_positions.append(len(text))
all_positions.sort()
# generate the list of pairs that match a certain intervention
# the pairs are formed by the initial and the end position of the
# intervention
text_chunks = list(slices(all_positions, 2))
for chunk in text_chunks:
# we assign the intervention according to which
# list of speaker interventions the position exists
# when slicing we add the speaker tag's length to exclude
# the speaker tag from the own intervention
if chunk[0] in speaker_positions[0]:
intervention = text[chunk[0]+len(speaker_tags[0]):chunk[1]]
interventions[speaker_tags[0]] += intervention
elif chunk[0] in speaker_positions[1]:
intervention = text[chunk[0]+len(speaker_tags[1]):chunk[1]]
interventions[speaker_tags[1]] += intervention
return interventions
text_interventions = get_text_interventions(text, [ FIRST_SPEAKER_TAG, SECOND_SPEAKER_TAG ])
备注:
- slices() 函数的作者归于以下答案:
Find consecutive combinations
如果您有任何疑问,可以在 itertools 文档中阅读更多详细信息:
如果您对示例有任何不理解的地方,请随时问我。
希望对你有帮助! =)
已解决
我有一个字符串,其中包含两个人之间的对话以及他们的说话人标签。
我想将该字符串拆分为两个子字符串,仅包含说话者 1 和说话者 2 的对话。
这是我用来获取成绩单的代码。
operation = client.long_running_recognize(config, audio)
response = operation.result(timeout=10000)
result = response.results[-1]
words_info = result.alternatives[0].words
transcript = ''
tag=1
speaker=""
for word_info in words_info:
if word_info.speaker_tag==tag:
speaker=speaker+" "+word_info.word
else:
transcript += "speaker {}: {}".format(tag,speaker) + '\n'
tag=word_info.speaker_tag
speaker=""+word_info.word
transcript += "speaker {}: {}".format(tag,speaker)
这会将说话者 1 和说话者 2 转录到同一个文件中。
已解决:解决方案要简单得多。感谢您的帮助。
transcript_1 = ''
transcript_2 = ''
for word_info in words_info:
if word_info.speaker_tag==1:
#speaker += " "+word_info.word
transcript_1 += " " + word_info.word
elif word_info.speaker_tag==2:
#speaker += " "+word_info.word
transcript_2 += " " + word_info.word
取决于你如何获取数据,我的意思是,如果你得到一个唯一的原始字符串,其中包含来自两个说话者的所有消息,或者你分别从每个说话者那里得到消息。
一种基本方法是建立字符串 "speaker X:"(其中 N 是演讲者编号)作为第一个演讲者的演讲者标签,然后您可以使用 NLTK and/or 内置函数,如 find()。
注意:当我谈论标签时,我指的是一些可以让我们确定消息是否来自某个说话者的表达方式。
示例: 您将获得包含演讲者所有发言的完整文本。
- 要遵循的步骤:
1) 设置所有发言者标签,以区分他们在全文中的发言。 示例:第一个演讲者的演讲者标签可以是 "speaker 1:"
2) 使用 str.find("speaker_tag")
查找演讲者的所有发言3) 将每个演讲者的所有发言添加到不同的数据结构中。 我认为演讲者的干预列表可能会有用,然后如果你想再次在一条短信中获得所有这些干预,你可以使用 一些内置函数,如 str.join() 再次将它们连接成一个字符串。
解决这个问题的其他选择是使用像 NLTK 这样的工具(我认为这个工具非常适合对文本进行分类)
它具有非常有用的功能,例如标记化,我认为它对解决您的问题很有用。
在下面的示例中,我将使用 find() 和切片作为有关文本标记化的基本示例:
文本数据:
text = "speaker 1: hello everyone, I am Thomas speaker 2: Hello friends, I am John speaker 1: How are you? I am great being here speaker 2: It's the same for me"
代码示例:
from itertools import islice, tee
FIRST_SPEAKER_TAG = "speaker 1:"
SECOND_SPEAKER_TAG = "speaker 2:"
def get_speaker_positions(text, speaker_tag):
total_interventions = text.count(speaker_tag)
positions = []
position = 0
for i in range(total_interventions):
positions.append(text.find(speaker_tag, position))
# we increase the position by the addition of all the previous
# positions to reach the following occurrences through the list of
# positions
position += sum(positions) + 1
return positions
def slices(iterable, n):
return zip(*(islice(it, i, None) for i, it in enumerate(tee(iterable, n))))
def get_text_interventions(text, speaker_tags):
# speakers' interventions of the text
interventions = { speaker_tag: "" for speaker_tag in speaker_tags }
# positions where start each intervention in the text
# (the last one is used to get the rest of the text, because it's the
# last intervention)
# (we need to sort the positions to get the interventions in the correct
# order)
speaker_positions = [
get_speaker_positions(text, speaker) for speaker in speaker_tags
]
all_positions = [
position for sublist in speaker_positions for position in sublist
]
all_positions.append(len(text))
all_positions.sort()
# generate the list of pairs that match a certain intervention
# the pairs are formed by the initial and the end position of the
# intervention
text_chunks = list(slices(all_positions, 2))
for chunk in text_chunks:
# we assign the intervention according to which
# list of speaker interventions the position exists
# when slicing we add the speaker tag's length to exclude
# the speaker tag from the own intervention
if chunk[0] in speaker_positions[0]:
intervention = text[chunk[0]+len(speaker_tags[0]):chunk[1]]
interventions[speaker_tags[0]] += intervention
elif chunk[0] in speaker_positions[1]:
intervention = text[chunk[0]+len(speaker_tags[1]):chunk[1]]
interventions[speaker_tags[1]] += intervention
return interventions
text_interventions = get_text_interventions(text, [ FIRST_SPEAKER_TAG, SECOND_SPEAKER_TAG ])
备注:
- slices() 函数的作者归于以下答案: Find consecutive combinations
如果您有任何疑问,可以在 itertools 文档中阅读更多详细信息:
如果您对示例有任何不理解的地方,请随时问我。 希望对你有帮助! =)