自定义 phrases/words 被 Google Speech-To-Text 忽略
Custom phrases/words are ignored by Google Speech-To-Text
我正在使用 python3 通过提供的 python 包(google-语音)转录带有 Google 语音到文本的音频文件。
有一个选项可以定义用于转录的自定义短语,如文档中所述:https://cloud.google.com/speech-to-text/docs/speech-adaptation
出于测试目的,我使用了一个包含文本的小音频文件:
[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]
我给出了以下短语以查看效果,例如,如果我希望使用正确的符号识别特定名称。在此示例中,我想将 burrows 更改为 barrows:
config = speech.RecognitionConfig(dict(
encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
sample_rate_hertz=24000,
language_code="en-US",
enable_word_time_offsets=True,
speech_contexts=[
speech.SpeechContext(dict(
phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
))
]
))
不幸的是,这似乎没有任何效果,因为输出仍然与没有上下文短语的输出相同。
我是不是用错了短语,或者它有如此高的置信度以至于它听到的单词确实是 burrows 以至于它会忽略我的短语?
PS:我也尝试使用 speech_v1p1beta1.AdaptationClient
和 speech_v1p1beta1.SpeechAdaptation
而不是将短语放入配置中,但这只会给我一个内部服务器错误,没有关于什么的附加信息出问题了。 https://cloud.google.com/speech-to-text/docs/adaptation
我已经创建了一个音频文件来重现您的场景,并且我能够使用 model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post 提高识别度以更好地理解适应模型。
现在,为了提高您的词组的识别度,我执行了以下操作:
- 我使用以下 page 和提到的短语创建了一个新的音频文件。
in this lecture we'll talk about the Burrows wheeler transform and the FM index
- 我的测试是基于此 code sample. This code creates a
PhraseSet
and CustomClass
that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI。下面是我用于改进的代码。
from os import pathconf_names
from google.cloud import speech_v1p1beta1 as speech
import argparse
def transcribe_with_model_adaptation(
project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
):
"""
Create`PhraseSet` and `CustomClasses` to create custom lists of similar
items that are likely to occur in your input data.
"""
import io
# Create the adaptation client
adaptation_client = speech.AdaptationClient()
# The parent resource where the custom class and phrase set will be created.
parent = f"projects/{project_id}/locations/{location}"
# Create the custom class resource
adaptation_client.create_custom_class(
{
"parent": parent,
"custom_class_id": custom_class_id,
"custom_class": {
"items": [
{"value": "barrows"}
]
},
}
)
custom_class_name = (
f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
)
# Create the phrase set resource
phrase_set_response = adaptation_client.create_phrase_set(
{
"parent": parent,
"phrase_set_id": phrase_set_id,
"phrase_set": {
"boost": 0,
"phrases": [
{"value": f"${{{custom_class_name}}}", "boost": 10},
{"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
],
},
}
)
phrase_set_name = phrase_set_response.name
# print(u"Phrase set name: {}".format(phrase_set_name))
# The next section shows how to use the newly created custom
# class and phrase set to send a transcription request with speech adaptation
# Speech adaptation configuration
speech_adaptation = speech.SpeechAdaptation(
phrase_set_references=[phrase_set_name])
# speech configuration object
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=24000,
language_code="en-US",
adaptation=speech_adaptation,
enable_word_time_offsets=True,
model="phone_call",
use_enhanced=True
)
# The name of the audio file to transcribe
# storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
with io.open(speech_file, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
# audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")
# Create the speech client
speech_client = speech.SpeechClient()
response = speech_client.recognize(config=config, audio=audio)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
# [END speech_transcribe_with_model_adaptation]
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("path", help="Path for audio file to be recognized")
args = parser.parse_args()
transcribe_with_model_adaptation(speech_file=args.path)
- 运行后,您将获得如下改进的识别度;但是,考虑到代码在运行时会尝试创建一个新的自定义 class 和一个新的短语集,如果尝试重新创建自定义 class 和短语集。
- 使用识别不改编
(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index
- 将识别与改编结合使用
(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the barrows wheeler transform and the FM index
最后,我想补充一些关于改进和我执行的代码的注释:
我使用了 flac
音频文件,因为它是 recommended 以获得最佳效果。
我使用了 model="phone_call"
和 use_enhanced=True
,因为这是 Cloud Speech-To-Text 使用我自己的音频文件识别的模型。此外,增强模型可以提供更好的结果,您可以查看 documentation 了解更多详情。请注意,此配置可能与您的音频文件不同。
考虑启用 data logging 到 Google 以从您的音频转录请求中收集数据。 Google 然后使用此数据改进其用于识别语音音频的机器学习模型。
我创建自定义 class 和短语集后,您可以使用 Speech-to-Text UI 快速更新和执行测试。只包含
我在phrase set中使用了参数boost,当你使用boost时,你为PhraseSet资源中的phrase项分配了一个权重值。在为音频数据中的单词选择可能的转录时,Speech-to-Text 会参考此加权值。值越高,Speech-to-Text 从可能的备选词中选择该词或短语的可能性就越高。
希望这些信息能帮助您提高认识。
我正在使用 python3 通过提供的 python 包(google-语音)转录带有 Google 语音到文本的音频文件。
有一个选项可以定义用于转录的自定义短语,如文档中所述:https://cloud.google.com/speech-to-text/docs/speech-adaptation
出于测试目的,我使用了一个包含文本的小音频文件:
[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]
我给出了以下短语以查看效果,例如,如果我希望使用正确的符号识别特定名称。在此示例中,我想将 burrows 更改为 barrows:
config = speech.RecognitionConfig(dict(
encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
sample_rate_hertz=24000,
language_code="en-US",
enable_word_time_offsets=True,
speech_contexts=[
speech.SpeechContext(dict(
phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
))
]
))
不幸的是,这似乎没有任何效果,因为输出仍然与没有上下文短语的输出相同。
我是不是用错了短语,或者它有如此高的置信度以至于它听到的单词确实是 burrows 以至于它会忽略我的短语?
PS:我也尝试使用 speech_v1p1beta1.AdaptationClient
和 speech_v1p1beta1.SpeechAdaptation
而不是将短语放入配置中,但这只会给我一个内部服务器错误,没有关于什么的附加信息出问题了。 https://cloud.google.com/speech-to-text/docs/adaptation
我已经创建了一个音频文件来重现您的场景,并且我能够使用 model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post 提高识别度以更好地理解适应模型。
现在,为了提高您的词组的识别度,我执行了以下操作:
- 我使用以下 page 和提到的短语创建了一个新的音频文件。
in this lecture we'll talk about the Burrows wheeler transform and the FM index
- 我的测试是基于此 code sample. This code creates a
PhraseSet
andCustomClass
that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI。下面是我用于改进的代码。
from os import pathconf_names
from google.cloud import speech_v1p1beta1 as speech
import argparse
def transcribe_with_model_adaptation(
project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
):
"""
Create`PhraseSet` and `CustomClasses` to create custom lists of similar
items that are likely to occur in your input data.
"""
import io
# Create the adaptation client
adaptation_client = speech.AdaptationClient()
# The parent resource where the custom class and phrase set will be created.
parent = f"projects/{project_id}/locations/{location}"
# Create the custom class resource
adaptation_client.create_custom_class(
{
"parent": parent,
"custom_class_id": custom_class_id,
"custom_class": {
"items": [
{"value": "barrows"}
]
},
}
)
custom_class_name = (
f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
)
# Create the phrase set resource
phrase_set_response = adaptation_client.create_phrase_set(
{
"parent": parent,
"phrase_set_id": phrase_set_id,
"phrase_set": {
"boost": 0,
"phrases": [
{"value": f"${{{custom_class_name}}}", "boost": 10},
{"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
],
},
}
)
phrase_set_name = phrase_set_response.name
# print(u"Phrase set name: {}".format(phrase_set_name))
# The next section shows how to use the newly created custom
# class and phrase set to send a transcription request with speech adaptation
# Speech adaptation configuration
speech_adaptation = speech.SpeechAdaptation(
phrase_set_references=[phrase_set_name])
# speech configuration object
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
sample_rate_hertz=24000,
language_code="en-US",
adaptation=speech_adaptation,
enable_word_time_offsets=True,
model="phone_call",
use_enhanced=True
)
# The name of the audio file to transcribe
# storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
with io.open(speech_file, "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
# audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")
# Create the speech client
speech_client = speech.SpeechClient()
response = speech_client.recognize(config=config, audio=audio)
for result in response.results:
# The first alternative is the most likely one for this portion.
print(u"Transcript: {}".format(result.alternatives[0].transcript))
# [END speech_transcribe_with_model_adaptation]
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument("path", help="Path for audio file to be recognized")
args = parser.parse_args()
transcribe_with_model_adaptation(speech_file=args.path)
- 运行后,您将获得如下改进的识别度;但是,考虑到代码在运行时会尝试创建一个新的自定义 class 和一个新的短语集,如果尝试重新创建自定义 class 和短语集。
- 使用识别不改编
(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index
- 将识别与改编结合使用
(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the barrows wheeler transform and the FM index
最后,我想补充一些关于改进和我执行的代码的注释:
我使用了
flac
音频文件,因为它是 recommended 以获得最佳效果。我使用了
model="phone_call"
和use_enhanced=True
,因为这是 Cloud Speech-To-Text 使用我自己的音频文件识别的模型。此外,增强模型可以提供更好的结果,您可以查看 documentation 了解更多详情。请注意,此配置可能与您的音频文件不同。考虑启用 data logging 到 Google 以从您的音频转录请求中收集数据。 Google 然后使用此数据改进其用于识别语音音频的机器学习模型。
我创建自定义 class 和短语集后,您可以使用 Speech-to-Text UI 快速更新和执行测试。只包含
我在phrase set中使用了参数boost,当你使用boost时,你为PhraseSet资源中的phrase项分配了一个权重值。在为音频数据中的单词选择可能的转录时,Speech-to-Text 会参考此加权值。值越高,Speech-to-Text 从可能的备选词中选择该词或短语的可能性就越高。
希望这些信息能帮助您提高认识。