Speech to text Java 网络应用程序可以提供实时字幕吗?
Speech to text Java web app for live caption possible?
这是关于 Google 语音转文本 API:
我想开发Spring启动Java网络应用程序:
- 应用程序在本地主机中启动
- 我打开浏览器访问 http://localhost:8080
- 应用程序显示简单 UI,主要 window 显示实时字幕
任何来自笔记本电脑扬声器的英语音频,可能是参与者正在讲话的缩放视频通话,我听到了他们的声音,我还在我的本地网络应用程序中看到了实时字幕
- 实时字幕以 window 的形式保留在屏幕上
- 实时字幕保存在文本文件中,因为新字幕不断附加在文本文件中
字幕具有最佳准确性并在说话者说话时快速显示字幕至关重要。
这能实现吗?如果 Google API 不可行,API 的替代方法是什么?
将语音转换为文本的最快、最有效的方法之一是 Java 语音 API(文档位于 https://www.oracle.com/java/technologies/speech-api-frequently-asked-questions.html)
在文本转换过程中,您需要将其分解成多个部分,因此,含义可能会略有变化,因为某些表达方式可能与单个单词具有不同的含义,但这有助于减少最终翻译的时间。然后通过API发送已经收到的片段(单词,短语)进行翻译。
您可以选择几个您喜欢的选项(例如 https://rapidapi.com/blog/best-translation-api/)并检查哪个选项工作得更快。根据我的经验,“Microsoft Translator Text”和“Google Translate”是最快的。我还认为您将无法获得即时翻译,但如果您测试几个 API 选项并考虑是否一次处理所有句子、短语或单个单词,您可以将翻译时间缩短为最小值。
如果我没理解错的话,恕我直言,我会把它分成两部分
将语音转录为文本,如下所示 google api
然后将字幕做为流覆盖
//
// Performs streaming speech recognition on raw PCM audio data.
//
// @param fileName the path to a PCM audio file to transcribe.
//
public static void streamingRecognizeFile(String fileName) throws Exception, IOException {
Path path = Paths.get(fileName);
byte[] data = Files.readAllBytes(path);
// Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
try (SpeechClient speech = SpeechClient.create()) {
// Configure request with local raw PCM audio
RecognitionConfig recConfig =
RecognitionConfig.newBuilder()
.setEncoding(AudioEncoding.LINEAR16)
.setLanguageCode("en-US")
.setSampleRateHertz(16000)
.setModel("default")
.build();
StreamingRecognitionConfig config =
StreamingRecognitionConfig.newBuilder().setConfig(recConfig).build();
class ResponseApiStreamingObserver<T> implements ApiStreamObserver<T> {
private final SettableFuture<List<T>> future = SettableFuture.create();
private final List<T> messages = new java.util.ArrayList<T>();
@Override
public void onNext(T message) {
messages.add(message);
}
@Override
public void onError(Throwable t) {
future.setException(t);
}
@Override
public void onCompleted() {
future.set(messages);
}
// Returns the SettableFuture object to get received messages / exceptions.
public SettableFuture<List<T>> future() {
return future;
}
}
ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver =
new ResponseApiStreamingObserver<>();
BidiStreamingCallable<StreamingRecognizeRequest, StreamingRecognizeResponse> callable =
speech.streamingRecognizeCallable();
ApiStreamObserver<StreamingRecognizeRequest> requestObserver =
callable.bidiStreamingCall(responseObserver);
// The first request must **only** contain the audio configuration:
requestObserver.onNext(
StreamingRecognizeRequest.newBuilder().setStreamingConfig(config).build());
// Subsequent requests must **only** contain the audio data.
requestObserver.onNext(
StreamingRecognizeRequest.newBuilder()
.setAudioContent(ByteString.copyFrom(data))
.build());
// Mark transmission as completed after sending the data.
requestObserver.onCompleted();
List<StreamingRecognizeResponse> responses = responseObserver.future().get();
for (StreamingRecognizeResponse response : responses) {
// For streaming recognize, the results list has one is_final result (if available) followed
// by a number of in-progress results (if iterim_results is true) for subsequent utterances.
// Just print the first result here.
StreamingRecognitionResult result = response.getResultsList().get(0);
// There can be several alternative transcripts for a given chunk of speech. Just use the
// first (most likely) one here.
SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
System.out.printf("Transcript : %s\n", alternative.getTranscript());
}
}
}
为您的移动语音overlay
https://github.com/algolia/voice-overlay-android
网页版 HTML 5 overlay
<video id="video" controls preload="metadata">
<source src="video/sintel-short.mp4" type="video/mp4">
<source src="video/sintel-short.webm" type="video/webm">
<track label="English" kind="subtitles" srclang="en" src="captions/vtt/sintel-en.vtt" default>
<track label="Deutsch" kind="subtitles" srclang="de" src="captions/vtt/sintel-de.vtt">
<track label="Español" kind="subtitles" srclang="es" src="captions/vtt/sintel-es.vtt">
</video>
// per the sample linked above you can feed the / append the captions
var subtitlesMenu;
if (video.textTracks) {
var df = document.createDocumentFragment();
var subtitlesMenu = df.appendChild(document.createElement('ul'));
subtitlesMenu.className = 'subtitles-menu';
subtitlesMenu.appendChild(createMenuItem('subtitles-off', '', 'Off'));
for (var i = 0; i < video.textTracks.length; i++) {
subtitlesMenu.appendChild(createMenuItem('subtitles-' + video.textTracks[i].language, video.textTracks[i].language, video.textTracks[i].label));
}
videoContainer.appendChild(subtitlesMenu);
}
这是关于 Google 语音转文本 API:
我想开发Spring启动Java网络应用程序:
- 应用程序在本地主机中启动
- 我打开浏览器访问 http://localhost:8080
- 应用程序显示简单 UI,主要 window 显示实时字幕 任何来自笔记本电脑扬声器的英语音频,可能是参与者正在讲话的缩放视频通话,我听到了他们的声音,我还在我的本地网络应用程序中看到了实时字幕
- 实时字幕以 window 的形式保留在屏幕上
- 实时字幕保存在文本文件中,因为新字幕不断附加在文本文件中
字幕具有最佳准确性并在说话者说话时快速显示字幕至关重要。
这能实现吗?如果 Google API 不可行,API 的替代方法是什么?
将语音转换为文本的最快、最有效的方法之一是 Java 语音 API(文档位于 https://www.oracle.com/java/technologies/speech-api-frequently-asked-questions.html)
在文本转换过程中,您需要将其分解成多个部分,因此,含义可能会略有变化,因为某些表达方式可能与单个单词具有不同的含义,但这有助于减少最终翻译的时间。然后通过API发送已经收到的片段(单词,短语)进行翻译。
您可以选择几个您喜欢的选项(例如 https://rapidapi.com/blog/best-translation-api/)并检查哪个选项工作得更快。根据我的经验,“Microsoft Translator Text”和“Google Translate”是最快的。我还认为您将无法获得即时翻译,但如果您测试几个 API 选项并考虑是否一次处理所有句子、短语或单个单词,您可以将翻译时间缩短为最小值。
如果我没理解错的话,恕我直言,我会把它分成两部分
将语音转录为文本,如下所示 google api
然后将字幕做为流覆盖
// // Performs streaming speech recognition on raw PCM audio data. // // @param fileName the path to a PCM audio file to transcribe. // public static void streamingRecognizeFile(String fileName) throws Exception, IOException { Path path = Paths.get(fileName); byte[] data = Files.readAllBytes(path); // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS try (SpeechClient speech = SpeechClient.create()) { // Configure request with local raw PCM audio RecognitionConfig recConfig = RecognitionConfig.newBuilder() .setEncoding(AudioEncoding.LINEAR16) .setLanguageCode("en-US") .setSampleRateHertz(16000) .setModel("default") .build(); StreamingRecognitionConfig config = StreamingRecognitionConfig.newBuilder().setConfig(recConfig).build(); class ResponseApiStreamingObserver<T> implements ApiStreamObserver<T> { private final SettableFuture<List<T>> future = SettableFuture.create(); private final List<T> messages = new java.util.ArrayList<T>(); @Override public void onNext(T message) { messages.add(message); } @Override public void onError(Throwable t) { future.setException(t); } @Override public void onCompleted() { future.set(messages); } // Returns the SettableFuture object to get received messages / exceptions. public SettableFuture<List<T>> future() { return future; } } ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver = new ResponseApiStreamingObserver<>(); BidiStreamingCallable<StreamingRecognizeRequest, StreamingRecognizeResponse> callable = speech.streamingRecognizeCallable(); ApiStreamObserver<StreamingRecognizeRequest> requestObserver = callable.bidiStreamingCall(responseObserver); // The first request must **only** contain the audio configuration: requestObserver.onNext( StreamingRecognizeRequest.newBuilder().setStreamingConfig(config).build()); // Subsequent requests must **only** contain the audio data. requestObserver.onNext( StreamingRecognizeRequest.newBuilder() .setAudioContent(ByteString.copyFrom(data)) .build()); // Mark transmission as completed after sending the data. requestObserver.onCompleted(); List<StreamingRecognizeResponse> responses = responseObserver.future().get(); for (StreamingRecognizeResponse response : responses) { // For streaming recognize, the results list has one is_final result (if available) followed // by a number of in-progress results (if iterim_results is true) for subsequent utterances. // Just print the first result here. StreamingRecognitionResult result = response.getResultsList().get(0); // There can be several alternative transcripts for a given chunk of speech. Just use the // first (most likely) one here. SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0); System.out.printf("Transcript : %s\n", alternative.getTranscript()); } } }
为您的移动语音overlay
https://github.com/algolia/voice-overlay-android
网页版 HTML 5 overlay
<video id="video" controls preload="metadata">
<source src="video/sintel-short.mp4" type="video/mp4">
<source src="video/sintel-short.webm" type="video/webm">
<track label="English" kind="subtitles" srclang="en" src="captions/vtt/sintel-en.vtt" default>
<track label="Deutsch" kind="subtitles" srclang="de" src="captions/vtt/sintel-de.vtt">
<track label="Español" kind="subtitles" srclang="es" src="captions/vtt/sintel-es.vtt">
</video>
// per the sample linked above you can feed the / append the captions
var subtitlesMenu;
if (video.textTracks) {
var df = document.createDocumentFragment();
var subtitlesMenu = df.appendChild(document.createElement('ul'));
subtitlesMenu.className = 'subtitles-menu';
subtitlesMenu.appendChild(createMenuItem('subtitles-off', '', 'Off'));
for (var i = 0; i < video.textTracks.length; i++) {
subtitlesMenu.appendChild(createMenuItem('subtitles-' + video.textTracks[i].language, video.textTracks[i].language, video.textTracks[i].label));
}
videoContainer.appendChild(subtitlesMenu);
}