使用 azureSDK 翻译 python 中扬声器输出的音频
Translate audio from speaker output in python with azureSDK
我正在申请,谁会让我翻译直播中扬声器发出的任何音频。这样,我就可以翻译来自任何直播应用程序(youtube、teams、zoom 等)的任何视频会议。我离解决方案不远了,但还没有。
Src 语言将是:fr-CA
或 en-US
Dst 语言将是:fr-Ca
或 en-US
我能够使用 pyaudio
的自定义版本从扬声器获取音频流,允许使用 windows 的 WASAPI 进行环回。(https://github.com/intxcc/pyaudio_portaudio)
下一步是在 speechsdk
.
中将流实时拍摄到 Azure 翻译 api
到目前为止,从扬声器获取流的部分工作正常,但是当我用 azure 插入它时,我没有任何错误,但它也没有 return 任何结果。事实上,每隔 30 秒我就会收到一条 reason=ResultReason.NoMatch
或毫无意义的文字贿赂。
我的第一个想法是来自扬声器 的流字节是 48khz,Azure 流不支持 2 个通道。(我想我在网上的某个地方读到它只支持 16khz 1 个通道,但我不确定)。如果是这样,我找到了一种将两个通道拆分为 1 个通道的方法,但我不知道如何实时从一大块字节中将 48khz 降到 16khz..
如有任何帮助,我们将不胜感激!谢谢。这是我的代码:
import time
import azure.cognitiveservices.speech as speechsdk
import pyaudio
import numpy as np
speech_key, service_region = "", "westus"
finalResultSRC = ""
finalResultDst = ""
RATE = 48000
KHz_RATE = int(RATE/1000)
CHUNK = int(RATE)
def translation_continuous():
"""performs continuous speech translation from input from an audio file"""
# <TranslationContinuous>
# set up translation parameters: source language and target languages
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region,
speech_recognition_language='fr-CA')
# setup the audio stream
audioFormat = speechsdk.audio.AudioStreamFormat(
samples_per_second=KHz_RATE, bits_per_sample=16, channels=2)
stream = speechsdk.audio.PushAudioInputStream(audioFormat)
translation_config.add_target_language("en-US")
stream = speechsdk.audio.PushAudioInputStream()
audio_config = speechsdk.audio.AudioConfig(stream=stream)
# Creates a translation recognizer using and audio file as input.
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config, audio_config=audio_config)
def result_callback(event_type, evt):
"""callback to display a translation result"""
# print("{}: {}\n\tTranslations: {}\n\tResult Json: {}".format(
# event_type, evt, evt.result.translations.items(), evt.result.json))
print(evt)
if event_type == "RECOGNIZING":
# Translate
print(evt.result.translations.items()[0][1])
# Original
# print(type(evt.result.json))
done = False
def stop_cb(evt):
"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
print('CLOSING on {}'.format(evt))
nonlocal done
done = True
# connect callback functions to the events fired by the recognizer
recognizer.session_started.connect(
lambda evt: print('SESSION STARTED: {}'.format(evt)))
recognizer.session_stopped.connect(
lambda evt: print('SESSION STOPPED {}'.format(evt)))
# event for intermediate results
recognizer.recognizing.connect(
lambda evt: result_callback('RECOGNIZING', evt))
# event for final result
recognizer.recognized.connect(
lambda evt: result_callback('RECOGNIZED', evt))
# cancellation event
recognizer.canceled.connect(lambda evt: print(
'CANCELED: {} ({})'.format(evt, evt.reason)))
# stop continuous recognition on either session stopped or canceled events
recognizer.session_stopped.connect(stop_cb)
recognizer.canceled.connect(stop_cb)
def synthesis_callback(evt):
"""
callback for the synthesis event
"""
print('SYNTHESIZING {}\n\treceived {} bytes of audio. Reason: {}'.format(
evt, len(evt.result.audio), evt.result.reason))
# connect callback to the synthesis event
recognizer.synthesizing.connect(synthesis_callback)
# start translation
recognizer.start_continuous_recognition()
# start pushing data until all data has been read from the file
try:
p = pyaudio.PyAudio()
pstream = p.open(
format=pyaudio.paInt16,
channels=2, rate=RATE,
input=True, frames_per_buffer=CHUNK,
input_device_index=5,
as_loopback=True
)
while(True):
frame = pstream.read(CHUNK)
#frames = wav_fh.readframes(n_bytes)
#print('read {} bytes'.format(len(frames)))
# if not frames:
# print('break')
# break
if frame:
#ch1 = cutChannelFromStream(frame, 1, 2)
print('got frame from speakers')
stream.write(frame)
time.sleep(1)
finally:
# stop recognition and clean up
stream.close()
recognizer.stop_continuous_recognition()
print(finalResultSRC)
# recognizer.stop_continuous_recognition()
# </TranslationContinuous>
translation_continuous()
我找到了可行的解决方案。我确实不得不将采样率降低到 16000hz 并使用单声道。我的代码基于此 Solution,但使用流块而不是从文件中读取。
我的职能是:
def downsampleFrames(data, inrate=48000, outrate=16000, inchannels=2, outchannels=1):
try:
converted = audioop.ratecv(data, 2, inchannels, inrate, outrate, None)
if outchannels == 1:
converted = audioop.tomono(converted[0], 2, 1, 0)
except:
print('Failed to downsample')
return False
return converted
我从 pyaudio 发送了一大块数据,如下所示:
p = pyaudio.PyAudio()
pstream = p.open(
format=pyaudio.paInt16,
channels=2, rate=RATE,
input=True, frames_per_buffer=CHUNK,
input_device_index=5,
as_loopback=True
)
while(True):
frame = pstream.read(CHUNK)
if frame:
downFrame = downsampleFrames(frame)
stream.write(downFrame)
我正在申请,谁会让我翻译直播中扬声器发出的任何音频。这样,我就可以翻译来自任何直播应用程序(youtube、teams、zoom 等)的任何视频会议。我离解决方案不远了,但还没有。
Src 语言将是:fr-CA
或 en-US
Dst 语言将是:fr-Ca
或 en-US
我能够使用 pyaudio
的自定义版本从扬声器获取音频流,允许使用 windows 的 WASAPI 进行环回。(https://github.com/intxcc/pyaudio_portaudio)
下一步是在 speechsdk
.
到目前为止,从扬声器获取流的部分工作正常,但是当我用 azure 插入它时,我没有任何错误,但它也没有 return 任何结果。事实上,每隔 30 秒我就会收到一条 reason=ResultReason.NoMatch
或毫无意义的文字贿赂。
我的第一个想法是来自扬声器 的流字节是 48khz,Azure 流不支持 2 个通道。(我想我在网上的某个地方读到它只支持 16khz 1 个通道,但我不确定)。如果是这样,我找到了一种将两个通道拆分为 1 个通道的方法,但我不知道如何实时从一大块字节中将 48khz 降到 16khz..
如有任何帮助,我们将不胜感激!谢谢。这是我的代码:
import time
import azure.cognitiveservices.speech as speechsdk
import pyaudio
import numpy as np
speech_key, service_region = "", "westus"
finalResultSRC = ""
finalResultDst = ""
RATE = 48000
KHz_RATE = int(RATE/1000)
CHUNK = int(RATE)
def translation_continuous():
"""performs continuous speech translation from input from an audio file"""
# <TranslationContinuous>
# set up translation parameters: source language and target languages
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=speech_key, region=service_region,
speech_recognition_language='fr-CA')
# setup the audio stream
audioFormat = speechsdk.audio.AudioStreamFormat(
samples_per_second=KHz_RATE, bits_per_sample=16, channels=2)
stream = speechsdk.audio.PushAudioInputStream(audioFormat)
translation_config.add_target_language("en-US")
stream = speechsdk.audio.PushAudioInputStream()
audio_config = speechsdk.audio.AudioConfig(stream=stream)
# Creates a translation recognizer using and audio file as input.
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config, audio_config=audio_config)
def result_callback(event_type, evt):
"""callback to display a translation result"""
# print("{}: {}\n\tTranslations: {}\n\tResult Json: {}".format(
# event_type, evt, evt.result.translations.items(), evt.result.json))
print(evt)
if event_type == "RECOGNIZING":
# Translate
print(evt.result.translations.items()[0][1])
# Original
# print(type(evt.result.json))
done = False
def stop_cb(evt):
"""callback that signals to stop continuous recognition upon receiving an event `evt`"""
print('CLOSING on {}'.format(evt))
nonlocal done
done = True
# connect callback functions to the events fired by the recognizer
recognizer.session_started.connect(
lambda evt: print('SESSION STARTED: {}'.format(evt)))
recognizer.session_stopped.connect(
lambda evt: print('SESSION STOPPED {}'.format(evt)))
# event for intermediate results
recognizer.recognizing.connect(
lambda evt: result_callback('RECOGNIZING', evt))
# event for final result
recognizer.recognized.connect(
lambda evt: result_callback('RECOGNIZED', evt))
# cancellation event
recognizer.canceled.connect(lambda evt: print(
'CANCELED: {} ({})'.format(evt, evt.reason)))
# stop continuous recognition on either session stopped or canceled events
recognizer.session_stopped.connect(stop_cb)
recognizer.canceled.connect(stop_cb)
def synthesis_callback(evt):
"""
callback for the synthesis event
"""
print('SYNTHESIZING {}\n\treceived {} bytes of audio. Reason: {}'.format(
evt, len(evt.result.audio), evt.result.reason))
# connect callback to the synthesis event
recognizer.synthesizing.connect(synthesis_callback)
# start translation
recognizer.start_continuous_recognition()
# start pushing data until all data has been read from the file
try:
p = pyaudio.PyAudio()
pstream = p.open(
format=pyaudio.paInt16,
channels=2, rate=RATE,
input=True, frames_per_buffer=CHUNK,
input_device_index=5,
as_loopback=True
)
while(True):
frame = pstream.read(CHUNK)
#frames = wav_fh.readframes(n_bytes)
#print('read {} bytes'.format(len(frames)))
# if not frames:
# print('break')
# break
if frame:
#ch1 = cutChannelFromStream(frame, 1, 2)
print('got frame from speakers')
stream.write(frame)
time.sleep(1)
finally:
# stop recognition and clean up
stream.close()
recognizer.stop_continuous_recognition()
print(finalResultSRC)
# recognizer.stop_continuous_recognition()
# </TranslationContinuous>
translation_continuous()
我找到了可行的解决方案。我确实不得不将采样率降低到 16000hz 并使用单声道。我的代码基于此 Solution,但使用流块而不是从文件中读取。
我的职能是:
def downsampleFrames(data, inrate=48000, outrate=16000, inchannels=2, outchannels=1):
try:
converted = audioop.ratecv(data, 2, inchannels, inrate, outrate, None)
if outchannels == 1:
converted = audioop.tomono(converted[0], 2, 1, 0)
except:
print('Failed to downsample')
return False
return converted
我从 pyaudio 发送了一大块数据,如下所示:
p = pyaudio.PyAudio()
pstream = p.open(
format=pyaudio.paInt16,
channels=2, rate=RATE,
input=True, frames_per_buffer=CHUNK,
input_device_index=5,
as_loopback=True
)
while(True):
frame = pstream.read(CHUNK)
if frame:
downFrame = downsampleFrames(frame)
stream.write(downFrame)