How to synchronize a TargetDataLine and SourceDataLine in Java (Synchronize audio recording and playback)

Question

我正在尝试创建一个 Java 应用程序，它能够播放音频回放、录制用户语音并判断用户是否在正确的时间合唱。

目前，我只专注于录制和播放音频（曲调识别超出范围）。

为此，我使用了 Java 音频 API 中的 TargetDataLine 和 SourceDataLine。首先，我开始录音，然后启动音频播放。由于我要保证用户在正确的时间唱歌，所以我需要保持录制的音频和播放的音频同步。

例如，如果在录音后1秒开始播放音频，我知道我会忽略记录缓冲区中第一秒的数据。

我使用以下代码进行测试（代码远非完美，但仅用于测试目的）。

import javax.sound.sampled.*;
import java.io.File;
import java.io.IOException;

class AudioSynchro {

private TargetDataLine targetDataLine;
private SourceDataLine sourceDataLine;
private AudioInputStream ais;
private AudioFormat recordAudioFormat;
private AudioFormat playAudioFormat;

public AudioSynchro(String sourceFile) throws IOException, UnsupportedAudioFileException {
    ais = AudioSystem.getAudioInputStream(new File(sourceFile));

    recordAudioFormat = new AudioFormat(44100f, 16, 1, true, false);
    playAudioFormat = ais.getFormat();
}

//Enumerate the mixers
public void enumerate() {
    try {
        Mixer.Info[] mixerInfo =
                AudioSystem.getMixerInfo();
        System.out.println("Available mixers:");
        for(int cnt = 0; cnt < mixerInfo.length;
            cnt++){
            System.out.println(mixerInfo[cnt].
                    getName());
        }
    } catch (Exception e) {
        e.printStackTrace();
    }
}

//Init datalines
public void initDataLines() throws LineUnavailableException {
    Mixer.Info[] mixerInfo =
            AudioSystem.getMixerInfo();

    DataLine.Info targetDataLineInfo = new DataLine.Info(TargetDataLine.class, recordAudioFormat);

    Mixer targetMixer = AudioSystem.getMixer(mixerInfo[5]);

    targetDataLine = (TargetDataLine)targetMixer.getLine(targetDataLineInfo);

    DataLine.Info sourceDataLineInfo = new DataLine.Info(SourceDataLine.class, playAudioFormat);

    Mixer sourceMixer = AudioSystem.getMixer(mixerInfo[3]);

    sourceDataLine = (SourceDataLine)sourceMixer.getLine(sourceDataLineInfo);
}

public void startRecord() throws LineUnavailableException {
    AudioInputStream stream = new AudioInputStream(targetDataLine);

    targetDataLine.open(recordAudioFormat);

    byte currentByteBuffer[] = new byte[512];

    Runnable readAudioStream = new Runnable() {
        @Override
        public void run() {
            int count = 0;
            try {
                targetDataLine.start();
                while ((count = stream.read(currentByteBuffer)) != -1) {
                    //Do something
                }
            }
            catch(Exception e) {
                e.printStackTrace();
            }
        }
    };
    Thread thread = new Thread(readAudioStream);
    thread.start();
}

public void startPlay() throws LineUnavailableException {
    sourceDataLine.open(playAudioFormat);
    sourceDataLine.start();

    Runnable playAudio = new Runnable() {
        @Override
        public void run() {
            try {
                int nBytesRead = 0;
                byte[] abData = new byte[8192];
                while (nBytesRead != -1) {
                    nBytesRead = ais.read(abData, 0, abData.length);
                    if (nBytesRead >= 0) {
                        int nBytesWritten = sourceDataLine.write(abData, 0, nBytesRead);
                    }
                }

                sourceDataLine.drain();
                sourceDataLine.close();
            }
            catch(Exception e) {
                e.printStackTrace();
            }
        }
    };
    Thread thread = new Thread(playAudio);
    thread.start();
}

public void printStats() {
    Runnable stats = new Runnable() {

        @Override
        public void run() {
            while(true) {
                long targetDataLinePosition = targetDataLine.getMicrosecondPosition();
                long sourceDataLinePosition = sourceDataLine.getMicrosecondPosition();
                long delay = targetDataLinePosition - sourceDataLinePosition;
                System.out.println(targetDataLinePosition+"\t"+sourceDataLinePosition+"\t"+delay);

                try {
                    Thread.sleep(20);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }
        }
    };

    Thread thread = new Thread(stats);
    thread.start();
}

public static void main(String[] args) {
    try {
        AudioSynchro audio = new AudioSynchro("C:\dev\intellij-ws\guitar-challenge\src\main\resources\com\ouestdev\guitarchallenge\al_adagi.mid");
        audio.enumerate();
        audio.initDataLines();
        audio.startRecord();
        audio.startPlay();
        audio.printStats();
    } catch (IOException | LineUnavailableException | UnsupportedAudioFileException e) {
        e.printStackTrace();
    }
}

}

代码初始化2条数据线，开始录音，开始播放音频并显示统计数据。 enumerate() 方法用于显示系统上可用的混音器。您必须根据您的系统更改 initDataLines() 方法中使用的混合器以进行您自己的测试。 printStats 方法() 启动一个线程，以微秒为单位询问 2 条数据线的位置。这是我尝试用来跟踪同步的数据。我观察到的是 2 条数据线并不总是保持同步。这是我的输出控制台的简短摘录：

130000 0 130000

150000 748 149252

170000 20748 149252

190000 40748 149252

210000 60748 149252

230000 80748 149252

250000 100748 149252

270000 120748 149252

290000 140748 149252

310000 160748 149252

330000 180748 149252

350000 190748 159252

370000 210748 159252

390000 240748 149252

410000 260748 149252

430000 280748 149252

450000 300748 149252

470000 310748 159252

490000 340748 149252

510000 350748 159252

530000 370748 159252

正如我们所见，延迟可能有规律地从 10 毫秒变化，因此我无法准确判断录音缓冲区中的哪个位置与播放缓冲区的开头匹配。特别是，在前面的例子中，我不知道应该从 149252 还是 159252 这个位置开始。在音频处理方面，10 毫秒很重要，我想要更准确的东西（1 或 2 毫秒是可以接受的）。而且，这听起来真的很奇怪，当2个措施之间存在差异时，仍然有10毫秒的差距。

然后我试图进一步推动我的测试，但我没有得到更好的结果： - 尝试使用更大或更小的缓冲区 - 为回放尝试了两倍大的缓冲区。由于音频文件是立体声，因此消耗了更多字节（2 bytes/frame 用于录制，4/bytes/frame 用于播放） - 尝试在同一音频设备上录制和播放

在我看来，有两种同步 2 个缓冲区的策略： - 我想做什么。精确确定播放开始的记录缓冲区中的位置。 - 同步开始录制和播放。

在这两种策略中，我都需要保证保持同步。

你们中有人遇到过此类问题吗？

目前，我为我的应用程序使用 Java 12 和 JavaFx，但我准备使用另一个框架。我没有尝试过，但使用框架 lwjgl（https://www.lwjgl.org/ 基于 OpenAl）或 beads（http://www.beadsproject.net/）可能会获得更好的结果和更多的控制。如果你们谁知道他的框架，可以给我一个return，我很感兴趣。

最后，最后一个可接受的解决方案是更改编程语言。

Answer 1

我还没有对 TargetDataLines 做太多，但我想我可以提供一个有用的观察和建议。

首先，您编写的测试可能是测量多线程算法中的方差，而不是文件时序的滑动。 JVM 在处理线程之间来回跳转的方式可能是非常不可预测的。您可以阅读 good article on real time, low-latency coding in Java 以获取背景信息。

其次，Java 将阻塞队列与音频 IO 结合使用的方式提供了很大的稳定性。否则，我们会在播放或录音过程中听到各种音频伪影。

这里有一个可以尝试的想法：创建一个 runnable，它有一个 while 循环，处理来自 TargetDataLine 和 [=14= 的相同数量的帧] 在同一次迭代中。这个 runnable 可以松散耦合（使用布尔值转 on/off 行）。

主要好处是您知道每次循环迭代都会生成协调的数据。

编辑：以下是我在帧计数方面所做的几个示例： (1) 我有一个音频循环，它在处理过程中对帧进行计数。所有时间都严格由处理的帧数决定。我从不费心从 SDL 的位置读取读数。我写了一个节拍器，它每 N 帧（其中 N 是基于速度）启动一次合成点击。在第 N 帧，合成点击的数据混合到从 SDL 发送出去的音频数据中。我通过这种方法获得的计时准确性非常出色。

另一个应用程序，在第N帧，我发起了一个visual/graphical事件。图形循环通常设置为 60fps，音频设置为 44100 fps。启动是通过松散耦合处理的：事件的布尔值由音频线程翻转（仅此而已，用无关的 activity 使音频线程混乱是危险的，可能导致卡顿和丢失）。图形处理循环（又名 "game loop"）获取布尔值变化并在自己的时间（60 fps）处理它。我以这种方式发生了一些不错的视觉 + 听觉同步，包括让物体的亮度与正在播放的声音的音量保持一致。这类似于许多人使用 Java.

编写的数字 VU 表

根据您希望的精度级别，我认为帧计数就足够了。我不知道还有什么其他方法可以提供同样准确的 Java。

Answer 2

我用下面的代码做了新的测试（Phil，告诉我这是否是你的想法）。

public void startAll() throws LineUnavailableException, IOException {
    AudioInputStream stream = new AudioInputStream(targetDataLine);

    targetDataLine.open(recordAudioFormat);

    byte reccordByteBuffer[] = new byte[512];
    byte playByteBuffer[] = new byte[1024];


    sourceDataLine.open(playAudioFormat);
    targetDataLine.start();
    sourceDataLine.start();

    Runnable audio = new Runnable() {
        @Override
        public void run() {
            int reccordCount = 0;
            int totalReccordCount = 0;
            int playCount = 0;
            int totalPlayCount = 0;
            int playWriteCount = 0;
            int totalWritePlayCount = 0;
            try {
                while (playCount != -1) {
                    reccordCount = stream.read(reccordByteBuffer);
                    totalReccordCount += reccordCount;
                    long targetDataLinePosition = targetDataLine.getLongFramePosition();
                    playCount = ais.read(playByteBuffer, 0, playByteBuffer.length);
                    playWriteCount = sourceDataLine.write(playByteBuffer, 0, playCount);
                    totalPlayCount += playCount;
                    totalWritePlayCount += playWriteCount;
                    long sourceDataLinePosition = sourceDataLine.getLongFramePosition();


                    long delay = targetDataLinePosition - sourceDataLinePosition;
                    System.out.println(targetDataLinePosition + "\t" + sourceDataLinePosition + "\t" + delay + "\t" + totalReccordCount + "\t" + totalPlayCount + "\t" + totalWritePlayCount + "\t" + System.nanoTime());
                }
            } catch (IOException e) {
                e.printStackTrace();
            }

        }
    };

    Thread thread = new Thread(audio);
    thread.start();

}

这是结果（我只放了一些，因为堆栈很长）。

1439300 <-- TargetDataLine 和 SourceDataLine 开始之间的偏移量，以 ns 为单位。

119297 0 119297 512 1024 1024 565993368423500

179297 0 179297 1024 2048 2048 565993388887000

189297 0 189297 1536 3072 3072 565993390006000

189297 0 189297 2048 4096 4096 565993390998900

189297 0 189297 2560 5120 5120 565993391737300

189297 0 189297 3072 6144 6144 565993392430700

189297 0 189297 3584 7168 7168 565993392608000

189297 0 189297 4096 8192 8192 565993393295200

189297 0 189297 4608 9216 9216 565993393971900

189297 0 189297 5120 10240 10240 565993394690200

189297 0 189297 5632 11264 11264 565993395476900

189297 0 189297 6144 12288 12288 565993396160600

189297 0 189297 6656 13312 13312 565993396864500

189297 0 189297 7168 14336 14336 565993397032000

189297 0 189297 7680 15360 15360 565993397736000

189297 0 189297 8192 16384 16384 565993398467800

199297 0 199297 8704 17408 17408 565993399156300

199297 0 199297 15360 30720 30720 565993406362500

199297 0 199297 15872 31744 31744 565993407001900

199297 0 199297 16384 32768 32768 565993407585200

329297 115804 213493 16896 33792 33792 565993532785500 <-- 播放从这里开始

329297 115804 213493 17408 34816 34816 565993533320600

329297 115804 213493 17920 35840 35840 565993533486300

329297 115804 213493 22016 44032 44032 565993536512600

329297 115804 213493 22528 45056 45056 565993536941700

329297 125804 203493 23040 46080 46080 565993537363100 <-- SourceDataLine 增加 10 ms 但 TargetDataLine 没有增加

329297 125804 203493 23552 47104 47104 565993537746900

329297 125804 203493 24064 48128 48128 565993538158600

339297 125804 213493 24576 49152 49152 565993538306400 <-- TargetDataLine 递增 10 毫秒，但 SourceDataLine 不递增。情况正在恢复。

339297 125804 213493 25088 50176 50176 565993538762200

469297 255804 213493 39424 78848 78848 565993674194900

469297 255804 213493 39936 79872 79872 565993674513700

469297 255804 213493 40448 80896 80896 565993674872000

469297 255804 213493 40960 81920 81920 565993675177000

599297 385804 213493 41472 82944 82944 565993800684100 <-- TargetDataLine 和 SourceDataLine 递增 10 毫秒。没有延迟。

599297 385804 213493 41984 83968 83968 565993800871800

599297 385804 213493 42496 84992 84992 565993801189300

599297 385804 213493 43008 86016 86016 565993801486800

599297 385804 213493 43520 87040 87040 565993801814500

我的观察如下：

即使同时启动2条数据线，它们之间也有很大的差距。使用 System.out.println (System.nanoTime() - targetStart) 进行的测量；表示偏移量为 1.4393 毫秒，而延迟变量在 213.493 毫秒和 203.493 毫秒之间。所以我们不能持有同时启动两条数据线以获得完美同步的解决方案。
SourceDataLine 在开始播放前收到了 33792 个字节
我们可以看到getMicrosecondPosition()方法的精度不是很好（getLongFramePosition()也好不到哪里去，getMicrosecondPosition()就是基于它计算的）。实际上，对于 targetDataline（记录），我们看到值 189297 显示了 14 次。 System.nanoTime() 方法估计的 14 次显示之间花费的时间为 8.4618 毫秒！这似乎证实了用这种方法不可能获得小于10毫秒的精度。
在我的例子中，使用的 Java 实现是 DirectAudioDevice$DirectTDL 和 DirectAudioDevice$DirectSDL（还有其他实现，尤其取决于 OS）。调用的低级方法是 static native long nGetBytePosition(long id, boolean isSource, long javaPos)。此方法是本机的，因此它需要用另一种语言实现（这肯定会吸引驱动程序）。缺乏精确性来自此方法，而不是直接来自 Java 代码。
可以看出，当其中一条数据线又用了10ms而另一条保持旧值时，就会发生偏移。当另一个偏移量也额外花费 10 毫秒时，偏移量将被吸收。这种现象在 printStats() 方法中不太明显，因为我们使用了 Thread.sleep(20).
单线程传递的事实变化不大。所以我认为 Java 音频 API 对于我想要完成的事情来说不够准确。
Phil 在他的评论中提到的文件指出 Java 声音 API 的结果是不确定的，并且它们是通过 RtAudio 和 Java.

How to synchronize a TargetDataLine and SourceDataLine in Java (Synchronize audio recording and playback)

How to synchronize a TargetDataLine and SourceDataLine in Java (Synchronize audio recording and playback)

java

audio

synchronization

audio-recording

audio-streaming