将python-sounddevice.RawInputStream生成的音频数据发送到Google Cloud Speech-to-Text进行异步识别

Question

我正在编写一个脚本，将数据从麦克风发送到 Google Cloud Speech-to-Text API。我需要访问 gRPC API 以在录制期间生成实时读数。录制完成后，我需要接入RESTAPI进行更精准的异步识别

直播部分正常。它基于快速入门 sample，但使用 python-sounddevice 而不是 pyAudio。下面的流将 cffi_backend_buffer 个对象记录到一个队列中，一个单独的线程收集这些对象，将它们转换为字节，并将它们提供给 API.

import queue

import sounddevice

class MicrophoneStream:
    def __init__(self, rate, blocksize, queue_live, queue):
        self.queue = queue
        self.queue_live = queue_live
        self._audio_stream = sounddevice.RawInputStream(
            samplerate = rate,
            dtype='int16',
            callback = self.callback,
            blocksize = blocksize,
            channels = 1,
            )

    def __enter__(self):
        self._audio_stream.start()
        return self

    def stop(self):
        self._audio_stream.stop()

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop()
        self._audio_stream.close()

    def callback(self, indata, frames, time, status):
        self.queue.put(indata)
        self.queue_live.put(indata)

我计划在录制完成后使用第二个队列进行异步识别。然而，像我在实时识别中所做的那样只发送字节串似乎不起作用：

from google.cloud import speech
    

client = speech.SpeechClient()
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code='en-US',
    max_alternatives=1)

audio_data = []
while not queue.empty():
    audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)

audio = speech.RecognitionAudio(content=audio_data)

response = client.recognize(config=config, audio=audio)

由于发送原始音频数据的字节串可用于流式识别，因此我认为原始数据和识别配置没有任何问题。也许还有更多的东西？我知道，如果我从 *.wav 文件中读取二进制数据并发送它而不是 audio_data，识别就会起作用。如何将原始音频数据转换为 PCM WAV 以便我可以将其发送到 API?

Answer 1

事实证明，这段代码有两处错误。

看起来我放入 queue 中的 cffi_backend_buffer objects 的行为类似于指向特定内存区域的指针。如果我立即访问它们，就像我在流式识别中所做的那样，它工作正常。但是，如果我将它们收集在 queue 中供以后使用，它们指向的缓冲区将被覆盖。解决方案是将字节串放入 queues 而不是：

    def callback(self, indata, frames, time, status):
        self.queue.put(bytes(indata))
        self.queue_live.put(bytes(indata))

异步识别需要PCM WAV文件有headers。显然，我的原始音频数据没有它们。解决方法是将数据写入*.wav文件，我是这样写的：

import io
import wave

from google.cloud import speech
    

client = speech.SpeechClient()
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code='en-US',
    max_alternatives=1)

# Collect raw audio data
audio_data = []
while not queue.empty():
    audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)

# Convert to a PCM WAV file with headers
file = io.BytesIO()
with wave.open(file, mode='wb') as w:
    w.setnchannels(1)
    w.setsampwidth(2)
    w.setframerate(16000)
    w.writeframes(audio_data)
file.seek(0)

audio = speech.RecognitionAudio(content=file.read())

response = client.recognize(config=config, audio=audio)

将python-sounddevice.RawInputStream生成的音频数据发送到Google Cloud Speech-to-Text进行异步识别

Sending audio data generated by python-sounddevice.RawInputStream to Google Cloud Speech-to-Text for asynchronous recognition

python

python-sounddevice

google-cloud-speech