如何在使用 Azure 认知语音翻译 API 中使用 python sdk 识别说话人?

How to identify speaker using python sdk in using Azure cognitive speech translation API?

我正在尝试使用 Azure 文档中提供的修改后的基于事件的合成代码示例进行语音到语音的翻译。然而,在这个过程中,我也想识别说话者(speaker1,speaker2),但我在 Python SDK 中没有看到一个函数可以帮助我将说话者识别为 speech=to-text 的一部分翻译。有人可以建议在语音到文本翻译过程中识别说话者的方法吗?下面是代码片段:

def translate_speech_to_text():

    translation_config = speechsdk.translation.SpeechTranslationConfig(subscription=speech_key, region=service_region)
    translation_config.speech_recognition_language = from_language
    translation_config.add_target_language(to_language)
    translation_config.voice_name = "en-GB-Susan"

    translation_config.request_word_level_timestamps()
    translation_config.output_format = speechsdk.OutputFormat(0)

    audio_input = speechsdk.AudioConfig(filename=filename)
    recognizer = speechsdk.translation.TranslationRecognizer(translation_config = translation_config, audio_config = audio_input)

    done = False

    def stop_cb(evt):
        """callback that stops continuous recognition upon receiving an event `evt`"""
        #print('CLOSING on {}'.format(evt))
        recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    all_results = []
    def handle_final_result(evt):
        #all_results.append(evt.result.text)
        #all_results.append(evt.result.translations['en'])
        all_results.append(evt.result.json)
    
    recognizer.recognized.connect(handle_final_result)
    # Connect callbacks to the events fired by the speech recognizer
    recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    #recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))
    recognizer.session_stopped.connect(stop_cb)
    recognizer.canceled.connect(stop_cb)
    
    def synthesis_callback(evt):
        print('Audio: {}'.format(len(evt.result.audio)))
        print('Reason: {}'.format(evt.result.reason))
        with open('out.wav', 'wb') as wavfile:
            wavfile.write(evt.result.audio)
   
    recognizer.synthesizing.connect(synthesis_callback)
    recognizer.start_continuous_recognition()    

    while not done:
        time.sleep(.5)
    
    print("Printing all results:")
    print(all_results)

translate_speech_to_text()

如果要识别说话人,应该使用Speech Service

推荐使用 REST API.

Text Independent - Identify Single Speaker

Speech ServicesC#C++JavaScriptREST中有完整的SDK,可以执行Speaker Recognition。 (我搜索了Python SDK,没找到可以直接用来识别的方法。)

建议

1. It is recommended to read the Speech related documents carefully and how to use this service.