使用 Twilio 从 phone 通话记录器进行实时语音识别

Question

我目前正在使用 Twilio 进行 phone 通话，我想添加一个语音识别元素，以便在用户说出特定短语时，我的后端可以采取特定操作。如果您熟悉 Twilio，它类似于 Gather 动词。它需要是实时的，因为如果识别出现问题，系统会提示用户进行澄清。

Answer 1

我不相信有任何实时工作的东西可以做到这一点。但是，您可以使用录音，将录音传递给另一个服务（想到 IBM 的 Watson Speech to Text），然后从那里处理它。它应该能够通过正确的工作流程相对快速地完成这项工作。我从未使用过 Watson，只是看到它被使用过。所以我不确定处理录音需要多长时间。我认为应该快速完成一两个单词命令。

抱歉，我无法提供更多指导。社区中的其他人可能有另一种方法。

Answer 2

IBM Watson 语音转文本服务 (STT) 具有此功能，称为关键字识别 (https://www.ibm.com/watson/developercloud/doc/speech-to-text/output.shtml)。 Watson STT 可以让你推送电话音频的实时流，不仅可以生成识别假设，还可以检测用户是否说了事先指定的句子或命令。实际上有一个展示此功能的演示，请尝试一下：

https://speech-to-text-demo.mybluemix.net/

Answer 3

这里简要介绍了这一点：https://whosebug.com/a/30224103/6189694

看来您必须先召开电话会议，然后以静音用户的身份加入才能收听电话会议。

Answer 4

要将语音识别添加到 Twilio Gather 动词，请将 "speech" 添加到 Gather 输入值，示例：input="dtmf speech"。在呼叫者说了些什么并安静下来之后，Twilio 服务器将语音翻译成文本并将文本发送到操作 URL，然后等待响应指令。您的程序可以使用文本来响应您的选择。一种选择是让您的程序使用更正指令（Say 动词）进行响应，并让调用者说更多内容，这些内容将由您的操作再次处理 URL.

Twilio 收集包括语音识别实现在内的文档： https://www.twilio.com/docs/api/twiml/gather

使用语音识别标识符的 Gather 动词示例 TwiML。

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Gather input="dtmf speech" language="en-us"
          numDigits="1"
          timeout="6"
          action="http://hostname/processUserResponse.py">
        <Say voice="alice" language="en-CA">
            Okay, speech recognition test. Enter any digit or say something.
        </Say>
    </Gather>
    <Say voice="alice" language="en-CA">
        Waited to long to say something. Response canceled ....
    </Say>
</Response>

Answer 5

C# .net Core IVR Gather 示例使用枚举列表而不是官方旧 C# 示例中可用的组合枚举，根据我上面的评论（还必须将 url.actionurl 转换为这个怪物）：

List<Gather.InputEnum> bothDtmfAndSpeech =
    new List<Gather.InputEnum>(2){
        Gather.InputEnum.Dtmf, Gather.InputEnum.Speech
    };
var gather = new Gather(
     action: new Uri(Url.Action("Show", "Menu")),
     numDigits: 1, input:bothDtmfAndSpeech, bargeIn: true);

使用 Twilio 从 phone 通话记录器进行实时语音识别

Real-time speech recognition from a phone call recorded with Twillio

twilio

speech-recognition

speech-to-text