自定义 phrases/words 被 Google Speech-To-Text 忽略

Question

我正在使用 python3 通过提供的 python 包（google-语音）转录带有 Google 语音到文本的音频文件。

有一个选项可以定义用于转录的自定义短语，如文档中所述：https://cloud.google.com/speech-to-text/docs/speech-adaptation

出于测试目的，我使用了一个包含文本的小音频文件：

[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]

我给出了以下短语以查看效果，例如，如果我希望使用正确的符号识别特定名称。在此示例中，我想将 burrows 更改为 barrows:

config = speech.RecognitionConfig(dict(
    encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
    sample_rate_hertz=24000,
    language_code="en-US",
    enable_word_time_offsets=True,
    speech_contexts=[
        speech.SpeechContext(dict(
            phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
        ))
    ]
))

不幸的是，这似乎没有任何效果，因为输出仍然与没有上下文短语的输出相同。

我是不是用错了短语，或者它有如此高的置信度以至于它听到的单词确实是 burrows 以至于它会忽略我的短语？

PS：我也尝试使用 speech_v1p1beta1.AdaptationClient 和 speech_v1p1beta1.SpeechAdaptation 而不是将短语放入配置中，但这只会给我一个内部服务器错误，没有关于什么的附加信息出问题了。 https://cloud.google.com/speech-to-text/docs/adaptation

Answer 1

我已经创建了一个音频文件来重现您的场景，并且我能够使用 model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post 提高识别度以更好地理解适应模型。

现在，为了提高您的词组的识别度，我执行了以下操作：

我使用以下 page 和提到的短语创建了一个新的音频文件。

in this lecture we'll talk about the Burrows wheeler transform and the FM index

我的测试是基于此 code sample. This code creates a PhraseSet and CustomClass that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI。下面是我用于改进的代码。

from os import pathconf_names
from google.cloud import speech_v1p1beta1 as speech
import argparse


def transcribe_with_model_adaptation(
    project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
):
    """
    Create`PhraseSet` and `CustomClasses` to create custom lists of similar
    items that are likely to occur in your input data.
    """
    import io

    # Create the adaptation client
    adaptation_client = speech.AdaptationClient()

    # The parent resource where the custom class and phrase set will be created.
    parent = f"projects/{project_id}/locations/{location}"

    # Create the custom class resource
    adaptation_client.create_custom_class(
        {
            "parent": parent,
            "custom_class_id": custom_class_id,
            "custom_class": {
                "items": [
                    {"value": "barrows"}
                ]
            },
        }
    )
    custom_class_name = (
        f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
    )
    # Create the phrase set resource
    phrase_set_response = adaptation_client.create_phrase_set(
        {
            "parent": parent,
            "phrase_set_id": phrase_set_id,
            "phrase_set": {
                "boost": 0,
                "phrases": [
                    {"value": f"${{{custom_class_name}}}", "boost": 10},
                    {"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
                ],
            },
        }
    )
    phrase_set_name = phrase_set_response.name
    # print(u"Phrase set name: {}".format(phrase_set_name))
 
    # The next section shows how to use the newly created custom
    # class and phrase set to send a transcription request with speech adaptation

    # Speech adaptation configuration
    speech_adaptation = speech.SpeechAdaptation(
        phrase_set_references=[phrase_set_name])

    # speech configuration object
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
        sample_rate_hertz=24000,
        language_code="en-US",
        adaptation=speech_adaptation,
        enable_word_time_offsets=True,
        model="phone_call",
        use_enhanced=True
    )

    # The name of the audio file to transcribe
    # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
    with io.open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)
    # audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")

    # Create the speech client
    speech_client = speech.SpeechClient()

    response = speech_client.recognize(config=config, audio=audio)

    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))

    # [END speech_transcribe_with_model_adaptation]


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
    )
    parser.add_argument("path", help="Path for audio file to be recognized")
    args = parser.parse_args()

    transcribe_with_model_adaptation(speech_file=args.path)

运行后，您将获得如下改进的识别度；但是，考虑到代码在运行时会尝试创建一个新的自定义 class 和一个新的短语集，如果尝试重新创建自定义 class 和短语集。

使用识别不改编

(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index

将识别与改编结合使用

(python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
Transcript: in this lecture will talk about the barrows wheeler transform and the FM index

最后，我想补充一些关于改进和我执行的代码的注释：

我使用了 flac 音频文件，因为它是 recommended 以获得最佳效果。
我使用了 model="phone_call" 和 use_enhanced=True，因为这是 Cloud Speech-To-Text 使用我自己的音频文件识别的模型。此外，增强模型可以提供更好的结果，您可以查看 documentation 了解更多详情。请注意，此配置可能与您的音频文件不同。
考虑启用 data logging 到 Google 以从您的音频转录请求中收集数据。 Google 然后使用此数据改进其用于识别语音音频的机器学习模型。
我创建自定义 class 和短语集后，您可以使用 Speech-to-Text UI 快速更新和执行测试。只包含
我在phrase set中使用了参数boost，当你使用boost时，你为PhraseSet资源中的phrase项分配了一个权重值。在为音频数据中的单词选择可能的转录时，Speech-to-Text 会参考此加权值。值越高，Speech-to-Text 从可能的备选词中选择该词或短语的可能性就越高。

希望这些信息能帮助您提高认识。

自定义 phrases/words 被 Google Speech-To-Text 忽略

Custom phrases/words are ignored by Google Speech-To-Text

python

speech-to-text

google-speech-api

google-speech-to-text-api