Speech to text Java 网络应用程序可以提供实时字幕吗?

Speech to text Java web app for live caption possible?

这是关于 Google 语音转文本 API:

API

我想开发Spring启动Java网络应用程序:

  1. 应用程序在本地主机中启动
  2. 我打开浏览器访问 http://localhost:8080
  3. 应用程序显示简单 UI,主要 window 显示实时字幕 任何来自笔记本电脑扬声器的英语音频,可能是参与者正在讲话的缩放视频通话,我听到了他们的声音,我还在我的本地网络应用程序中看到了实时字幕
  4. 实时字幕以 window 的形式保留在屏幕上
  5. 实时字幕保存在文本文件中,因为新字幕不断附加在文本文件中

字幕具有最佳准确性并在说话者说话时快速显示字幕至关重要。

这能实现吗?如果 Google API 不可行,API 的替代方法是什么?

将语音转换为文本的最快、最有效的方法之一是 Java 语音 API(文档位于 https://www.oracle.com/java/technologies/speech-api-frequently-asked-questions.html

在文本转换过程中,您需要将其分解成多个部分,因此,含义可能会略有变化,因为某些表达方式可能与单个单词具有不同的含义,但这有助于减少最终翻译的时间。然后通过API发送已经收到的片段(单词,短语)进行翻译。

您可以选择几个您喜欢的选项(例如 https://rapidapi.com/blog/best-translation-api/)并检查哪个选项工作得更快。根据我的经验,“Microsoft Translator Text”和“Google Translate”是最快的。我还认为您将无法获得即时翻译,但如果您测试几个 API 选项并考虑是否一次处理所有句子、短语或单个单词,您可以将翻译时间缩短为最小值。

如果我没理解错的话,恕我直言,我会把它分成两部分

  1. 将语音转录为文本,如下所示 google api

  2. 然后将字幕做为流覆盖

    //
    // Performs streaming speech recognition on raw PCM audio data.
    //
    // @param fileName the path to a PCM audio file to transcribe.
    //
    
    public static void streamingRecognizeFile(String fileName) throws Exception, IOException {
    Path path = Paths.get(fileName);
    byte[] data = Files.readAllBytes(path);
    
    // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
    try (SpeechClient speech = SpeechClient.create()) {
    
    // Configure request with local raw PCM audio
    RecognitionConfig recConfig =
        RecognitionConfig.newBuilder()
            .setEncoding(AudioEncoding.LINEAR16)
            .setLanguageCode("en-US")
            .setSampleRateHertz(16000)
            .setModel("default")
            .build();
    StreamingRecognitionConfig config =
        StreamingRecognitionConfig.newBuilder().setConfig(recConfig).build();
    
    class ResponseApiStreamingObserver<T> implements ApiStreamObserver<T> {
      private final SettableFuture<List<T>> future = SettableFuture.create();
      private final List<T> messages = new java.util.ArrayList<T>();
    
      @Override
      public void onNext(T message) {
        messages.add(message);
      }
    
      @Override
      public void onError(Throwable t) {
        future.setException(t);
      }
    
      @Override
      public void onCompleted() {
        future.set(messages);
      }
    
      // Returns the SettableFuture object to get received messages / exceptions.
      public SettableFuture<List<T>> future() {
        return future;
      }
    }
    
    ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver =
        new ResponseApiStreamingObserver<>();
    
    BidiStreamingCallable<StreamingRecognizeRequest, StreamingRecognizeResponse> callable =
        speech.streamingRecognizeCallable();
    
    ApiStreamObserver<StreamingRecognizeRequest> requestObserver =
        callable.bidiStreamingCall(responseObserver);
    
    // The first request must **only** contain the audio configuration:
    requestObserver.onNext(
        StreamingRecognizeRequest.newBuilder().setStreamingConfig(config).build());
    
    // Subsequent requests must **only** contain the audio data.
    requestObserver.onNext(
        StreamingRecognizeRequest.newBuilder()
            .setAudioContent(ByteString.copyFrom(data))
            .build());
    
    // Mark transmission as completed after sending the data.
    requestObserver.onCompleted();
    
    List<StreamingRecognizeResponse> responses = responseObserver.future().get();
    
    for (StreamingRecognizeResponse response : responses) {
      // For streaming recognize, the results list has one is_final result (if available) followed
      // by a number of in-progress results (if iterim_results is true) for subsequent utterances.
      // Just print the first result here.
      StreamingRecognitionResult result = response.getResultsList().get(0);
      // There can be several alternative transcripts for a given chunk of speech. Just use the
      // first (most likely) one here.
      SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
      System.out.printf("Transcript : %s\n", alternative.getTranscript());
      }
     }
    }
    

为您的移动语音overlay

https://github.com/algolia/voice-overlay-android

网页版 HTML 5 overlay


<video id="video" controls preload="metadata">
   <source src="video/sintel-short.mp4" type="video/mp4">
   <source src="video/sintel-short.webm" type="video/webm">
   <track label="English" kind="subtitles" srclang="en" src="captions/vtt/sintel-en.vtt" default>
   <track label="Deutsch" kind="subtitles" srclang="de" src="captions/vtt/sintel-de.vtt">
   <track label="Español" kind="subtitles" srclang="es" src="captions/vtt/sintel-es.vtt">
</video>

    // per the sample linked above you can feed the /  append the captions
     var subtitlesMenu;
if (video.textTracks) {
   var df = document.createDocumentFragment();
   var subtitlesMenu = df.appendChild(document.createElement('ul'));
   subtitlesMenu.className = 'subtitles-menu';
   subtitlesMenu.appendChild(createMenuItem('subtitles-off', '', 'Off'));
   for (var i = 0; i < video.textTracks.length; i++) {
      subtitlesMenu.appendChild(createMenuItem('subtitles-' + video.textTracks[i].language, video.textTracks[i].language, video.textTracks[i].label));
   }
   videoContainer.appendChild(subtitlesMenu);
}