Bing 语音识别服务 - SpeechClient 问题 "Audio format could not be parsed."

Question

我们目前正在评估直播场景中的 Bing 语音识别服务。我们正在获取 PCM 编码音频的实时流（16k 采样率、16 位、1 通道（又名单声道））并尝试将其发送到 Bing 语音识别服务。

我们通过在流式传输音频本身之前发送音频格式，成功地将 https://www.nuget.org/packages/Microsoft.ProjectOxford.SpeechRecognition-x64/ 中的 DataRecognitionClient 用于我们的场景，如下所示： _dataRecognitionClient.SendAudioFormat(SpeechAudioFormat.create16BitPCMFormat(16000));

然后我们循环播放音频流，如下所示：

_dataRecognitionClient.SendAudio(buffer, bytesRead);

这很好用。但是我们假设 ProjectOxford 库可能会被弃用，因为官方 Bing 语音识别网站 (https://www.microsoft.com/cognitive-services/en-us/Speech-api/documentation/GetStarted/GetStartedCSharpServiceLibrary) points to a different Nuget package, see: https://www.nuget.org/packages/Microsoft.Bing.Speech/

当我们使用此包中的 SpeechClient 时，我们在 SpeechClient 上执行 RecognizeAsync 时看到提到的 "Audio format could not be parsed" 错误。

var speechInput = new SpeechInput(producerConsumerStream,
new RequestMetadata(Guid.NewGuid(), new DeviceMetadata(DeviceType.Near,
DeviceFamily.Desktop, NetworkType.Ethernet, OsName.Windows, "Azure",
"Microsoft", "Current"), new ApplicationMetadata("App", "1.0"), "Speech"));
await _speechClient.RecognizeAsync(speechInput, new CancellationToken());

最后一行抛出错误。我们假设这是因为我们的 PCM 流没有 WAVE/RIFF header 因为它是流式传输。对于流式场景，DataRecognitionClient 具有 "SendAudioFormat" 方法。

SpeechClient不支持流式场景吗？

Answer 1

回答我自己的问题。我们通过在 WAVE header 前面加上一个虚假的样本总数（又名长度）来解决这个问题，请参阅：Create valid wav file header for streams in memory

Bing 语音识别服务 - SpeechClient 问题 "Audio format could not be parsed."

Bing Speech Recognition Service - SpeechClient issue "Audio format could not be parsed."

c#

audio

speech-recognition

bing

pcm