如何在移动应用程序的 webrtc 通信音频流上执行连续语音到文本

how to perform continuous speech to text on webrtc communication audio stream in mobile app

我正在尝试在 webrtc 纯音频通话期间向移动应用程序中的文本识别器添加连续语音。

我在移动端使用 React Native，信号部分使用 react-native-webrtc module 和自定义 Web api。我掌握了网络 api，所以如果它是唯一的解决方案，我可以在它的一侧添加该功能，但我更喜欢在客户端执行它以避免在没有时消耗带宽需要。

首先，我使用笔记本电脑浏览器研究并测试了一些想法。我的第一个想法是使用 webspeechapi 中的 SpeechRecognition 接口：https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

我已将 the audio only webrtc demo with the audiovisualiser demonstration in one page but there, I did not find how to connect a mediaElementSourceNode (created via AudioContext.createMediaElementSource(remoteStream) at line 44 of streamvisualizer.js) 合并到 web_speech_api SpeechRecognition class。在 mozilla 文档中，音频流似乎带有 class 的构造函数，可能会调用 getUserMedia() api.

其次，在我的研究过程中，我发现了两个文本引擎的开源语音：cmusphinx and mozilla's deep-speech. The first one have a js binding 并且看起来很棒 audioRecoder 我可以用我自己的 mediaElementSourceNode 从第一次尝试。但是，如何将其嵌入到我的 React 本机应用程序中？

还有 Android 和 iOS natives webrtc 模块，我也许可以连接到 cmusphinx 平台特定绑定（iOS, Android），但我不知道 native classes 互操作性。你能帮我吗？

我还没有创建任何 "grammar" 或定义 "hot-words" 因为我不确定所涉及的技术，但如果我能够将语音识别引擎连接到我的音频流。

您需要通过在呼叫中添加 another webrtc party 或通过某些其他协议 (TCP/Websocket/etc) 将音频流式传输到 ASR 服务器。在服务器上执行识别并返回结果。

First, I have worked and tested some ideas with my laptop browser. My first idea, was to use the SpeechRecognition interface from the webspeechapi : https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecognition

这是实验性的，在 Firefox 中并不真正有效。在 Chrome 中，它只直接接受麦克风输入，而不是来自呼叫者和被呼叫者的双流。

The first one have a js binding and seems great with the audioRecoder that I can feed with my own mediaElementSourceNode from the first try.

您将无法运行在您的 React Native 应用程序中将此作为本地识别

如何在移动应用程序的 webrtc 通信音频流上执行连续语音到文本

how to perform continuous speech to text on webrtc communication audio stream in mobile app

speech-to-text

cmusphinx

webrtc

react-native

web-audio-api