使用 azureSDK 翻译 python 中扬声器输出的音频

Translate audio from speaker output in python with azureSDK

我正在申请,谁会让我翻译直播中扬声器发出的任何音频。这样,我就可以翻译来自任何直播应用程序(youtube、teams、zoom 等)的任何视频会议。我离解决方案不远了,但还没有。

Src 语言将是:fr-CAen-US Dst 语言将是:fr-Caen-US

我能够使用 pyaudio 的自定义版本从扬声器获取音频流,允许使用 windows 的 WASAPI 进行环回。(https://github.com/intxcc/pyaudio_portaudio)

下一步是在 speechsdk.

中将流实时拍摄到 Azure 翻译 api

到目前为止,从扬声器获取流的部分工作正常,但是当我用 azure 插入它时,我没有任何错误,但它也没有 return 任何结果。事实上,每隔 30 秒我就会收到一条 reason=ResultReason.NoMatch 或毫无意义的文字贿赂。

我的第一个想法是来自扬声器 的流字节是 48khz,Azure 流不支持 2 个通道。(我想我在网上的某个地方读到它只支持 16khz 1 个通道,但我不确定)。如果是这样,我找到了一种将两个通道拆分为 1 个通道的方法,但我不知道如何实时从一大块字节中将 48khz 降到 16khz..

如有任何帮助,我们将不胜感激!谢谢。这是我的代码:

import time
import azure.cognitiveservices.speech as speechsdk
import pyaudio
import numpy as np
speech_key, service_region = "", "westus"
finalResultSRC = ""
finalResultDst = ""

RATE = 48000
KHz_RATE = int(RATE/1000)
CHUNK = int(RATE)


def translation_continuous():
    """performs continuous speech translation from input from an audio file"""
    # <TranslationContinuous>
    # set up translation parameters: source language and target languages
    translation_config = speechsdk.translation.SpeechTranslationConfig(
        subscription=speech_key, region=service_region,
        speech_recognition_language='fr-CA')

    # setup the audio stream
    audioFormat = speechsdk.audio.AudioStreamFormat(
        samples_per_second=KHz_RATE, bits_per_sample=16, channels=2)
    stream = speechsdk.audio.PushAudioInputStream(audioFormat)

    translation_config.add_target_language("en-US")
    stream = speechsdk.audio.PushAudioInputStream()
    audio_config = speechsdk.audio.AudioConfig(stream=stream)

    # Creates a translation recognizer using and audio file as input.
    recognizer = speechsdk.translation.TranslationRecognizer(
        translation_config=translation_config, audio_config=audio_config)

    def result_callback(event_type, evt):
        """callback to display a translation result"""
        # print("{}: {}\n\tTranslations: {}\n\tResult Json: {}".format(
        # event_type, evt, evt.result.translations.items(), evt.result.json))
        print(evt)
        if event_type == "RECOGNIZING":
            # Translate
            print(evt.result.translations.items()[0][1])
            # Original
            # print(type(evt.result.json))

    done = False

    def stop_cb(evt):
        """callback that signals to stop continuous recognition upon receiving an event `evt`"""
        print('CLOSING on {}'.format(evt))
        nonlocal done
        done = True

    # connect callback functions to the events fired by the recognizer
    recognizer.session_started.connect(
        lambda evt: print('SESSION STARTED: {}'.format(evt)))
    recognizer.session_stopped.connect(
        lambda evt: print('SESSION STOPPED {}'.format(evt)))
    # event for intermediate results
    recognizer.recognizing.connect(
        lambda evt: result_callback('RECOGNIZING', evt))
    # event for final result
    recognizer.recognized.connect(
        lambda evt: result_callback('RECOGNIZED', evt))
    # cancellation event
    recognizer.canceled.connect(lambda evt: print(
        'CANCELED: {} ({})'.format(evt, evt.reason)))

    # stop continuous recognition on either session stopped or canceled events
    recognizer.session_stopped.connect(stop_cb)
    recognizer.canceled.connect(stop_cb)

    def synthesis_callback(evt):
        """
        callback for the synthesis event
        """
        print('SYNTHESIZING {}\n\treceived {} bytes of audio. Reason: {}'.format(
            evt, len(evt.result.audio), evt.result.reason))

    # connect callback to the synthesis event
    recognizer.synthesizing.connect(synthesis_callback)

    # start translation
    recognizer.start_continuous_recognition()
    # start pushing data until all data has been read from the file
    try:
        p = pyaudio.PyAudio()
        pstream = p.open(
            format=pyaudio.paInt16,
            channels=2, rate=RATE,
            input=True, frames_per_buffer=CHUNK,
            input_device_index=5,
            as_loopback=True
        )
        while(True):
            frame = pstream.read(CHUNK)

            #frames = wav_fh.readframes(n_bytes)
            #print('read {} bytes'.format(len(frames)))
            # if not frames:
            #     print('break')
            #     break
            if frame:
                #ch1 = cutChannelFromStream(frame, 1, 2)
                print('got frame from speakers')
                stream.write(frame)
            time.sleep(1)

    finally:
        # stop recognition and clean up
        stream.close()
        recognizer.stop_continuous_recognition()

    print(finalResultSRC)
    # recognizer.stop_continuous_recognition()
    # </TranslationContinuous>


translation_continuous()

我找到了可行的解决方案。我确实不得不将采样率降低到 16000hz 并使用单声道。我的代码基于此 Solution,但使用流块而不是从文件中读取。

我的职能是:

def downsampleFrames(data, inrate=48000, outrate=16000, inchannels=2, outchannels=1):
    try:
        converted = audioop.ratecv(data, 2, inchannels, inrate, outrate, None)
        if outchannels == 1:
            converted = audioop.tomono(converted[0], 2, 1, 0)
    except:
        print('Failed to downsample')
        return False

    return converted

我从 pyaudio 发送了一大块数据,如下所示:

p = pyaudio.PyAudio()
        pstream = p.open(
            format=pyaudio.paInt16,
            channels=2, rate=RATE,
            input=True, frames_per_buffer=CHUNK,
            input_device_index=5,
            as_loopback=True
        )
        while(True):
            frame = pstream.read(CHUNK)
            if frame:
                downFrame = downsampleFrames(frame)
                stream.write(downFrame)