如何在 python 中使用麦克风获得准确的计时

How to get accurate timing using microphone in python

我正在尝试使用 PC microphone 进行节拍检测,然后使用节拍时间戳计算多个连续节拍之间的距离。我选择 python 是因为 material 可用且开发速度很快。通过搜索互联网,我想出了这个简单的代码(还没有先进的峰值检测或任何东西,如果需要的话稍后会出现):

import pyaudio
import struct
import math
import time


SHORT_NORMALIZE = (1.0/32768.0)


def get_rms(block):
    # RMS amplitude is defined as the square root of the
    # mean over time of the square of the amplitude.
    # so we need to convert this string of bytes into
    # a string of 16-bit samples...

    # we will get one short out for each
    # two chars in the string.
    count = len(block)/2
    format = "%dh" % (count)
    shorts = struct.unpack(format, block)

    # iterate over the block.
    sum_squares = 0.0
    for sample in shorts:
        # sample is a signed short in +/- 32768.
        # normalize it to 1.0
        n = sample * SHORT_NORMALIZE
        sum_squares += n*n

    return math.sqrt(sum_squares / count)


CHUNK = 32
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

elapsed_time = 0
prev_detect_time = 0

while True:
    data = stream.read(CHUNK)
    amplitude = get_rms(data)
    if amplitude > 0.05:  # value set by observing graphed data captured from mic
        elapsed_time = time.perf_counter() - prev_detect_time
        if elapsed_time > 0.1:  # guard against multiple spikes at beat point
            print(elapsed_time)
            prev_detect_time = time.perf_counter()

def close_stream():
  stream.stop_stream()
  stream.close()
  p.terminate()

该代码在无声环境下工作得很好,前两个时刻我都非常满意 运行,但后来我尝试了它的准确性,但我不太满意。为了测试这一点,我使用了两种方法:phone 节拍器设置为 60bpm(将 tic toc 声音发射到微型 phone)和一个连接到蜂鸣器的 Arduino,它由准确的 Chronodot RTC 以 1Hz 的速率触发。蜂鸣器向 microphone 发出蜂鸣声,触发检测。这两种方法的结果看起来相似(数字表示以秒为单位的两次节拍检测之间的距离):

0.9956681643835616
1.0056331689497717
0.9956100091324198
1.0058207853881278
0.9953449497716891
1.0052103013698623
1.0049350136986295
0.9859074337899543
1.004996383561644
0.9954095342465745
1.0061518904109583
0.9953025753424658
1.0051235068493156
1.0057199634703196
0.984839305936072
1.00610396347032
0.9951862648401821
1.0053146301369864
0.9960100821917806
1.0053391780821919
0.9947373881278523
1.0058608219178105
1.0056580091324214
0.9852110319634697
1.0054473059360731
0.9950465753424638
1.0058237077625556
0.995704694063928
1.0054566575342463
0.9851026118721435
1.0059882374429243
1.0052523835616398
0.9956161461187207
1.0050863926940607
0.9955758173515932
1.0058052968036577
0.9953960913242028
1.0048014611872205
1.006336876712325
0.9847434520547935
1.0059712876712297

现在我非常有信心至少 Arduino 可以精确到 1 毫秒(目标精度)。结果往往会偏离 +- 5 毫秒,但有时甚至会偏离 15 毫秒,这是不可接受的。有没有办法实现更高的准确性,或者是 python / 声卡 / 其他东西的限制?谢谢!

编辑: 将tom10和barny的建议合并到代码中后,代码如下所示:

import pyaudio
import struct
import math
import psutil
import os


def set_high_priority():
    p = psutil.Process(os.getpid())
    p.nice(psutil.HIGH_PRIORITY_CLASS)


SHORT_NORMALIZE = (1.0/32768.0)


def get_rms(block):
    # RMS amplitude is defined as the square root of the
    # mean over time of the square of the amplitude.
    # so we need to convert this string of bytes into
    # a string of 16-bit samples...

    # we will get one short out for each
    # two chars in the string.
    count = len(block)/2
    format = "%dh" % (count)
    shorts = struct.unpack(format, block)

    # iterate over the block.
    sum_squares = 0.0
    for sample in shorts:
        # sample is a signed short in +/- 32768.
        # normalize it to 1.0
        n = sample * SHORT_NORMALIZE
        sum_squares += n*n

    return math.sqrt(sum_squares / count)


CHUNK = 4096
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
RUNTIME_SECONDS = 10

set_high_priority()

p = pyaudio.PyAudio()

stream = p.open(format=FORMAT,
                channels=CHANNELS,
                rate=RATE,
                input=True,
                frames_per_buffer=CHUNK)

elapsed_time = 0
prev_detect_time = 0
TIME_PER_CHUNK = 1000 / RATE * CHUNK
SAMPLE_GROUP_SIZE = 32  # 1 sample = 2 bytes, group is closest to 1 msec elapsing
TIME_PER_GROUP = 1000 / RATE * SAMPLE_GROUP_SIZE

for i in range(0, int(RATE / CHUNK * RUNTIME_SECONDS)):
    data = stream.read(CHUNK)
    time_in_chunk = 0
    group_index = 0
    for j in range(0, len(data), (SAMPLE_GROUP_SIZE * 2)):
        group = data[j:(j + (SAMPLE_GROUP_SIZE * 2))]
        amplitude = get_rms(group)
        amplitudes.append(amplitude)
        if amplitude > 0.02:
            current_time = (elapsed_time + time_in_chunk)
            time_since_last_beat = current_time - prev_detect_time
            if time_since_last_beat > 500:
                print(time_since_last_beat)
                prev_detect_time = current_time
        time_in_chunk = (group_index+1) * TIME_PER_GROUP
        group_index += 1
    elapsed_time = (i+1) * TIME_PER_CHUNK

stream.stop_stream()
stream.close()
p.terminate()

使用这段代码我得到了以下结果(这次的单位是毫秒而不是秒):

999.909297052154
999.9092970521542
999.9092970521542
999.9092970521542
999.9092970521542
1000.6349206349205
999.9092970521551
999.9092970521524
999.9092970521542
999.909297052156
999.9092970521542
999.9092970521542
999.9092970521524
999.9092970521542

如果我没有犯错的话,它看起来比以前好多了,而且已经达到了亚毫秒级的精度。感谢 tom10 和 barny 的帮助。

您没有获得正确节拍时间的原因是您丢失了大块音频数据。也就是说,声卡正在读取数据块,但在它被下一个数据块覆盖之前您没有收集数据

不过,首先,对于这个问题,你需要区分计时精度实时响应的思想。

声卡的计时精度应该非常好,比 ms 好得多,您应该能够在从声卡读取的数据中捕获所有这些精度。你电脑OS的实时响应应该很差,比ms还差很多。 也就是说,您应该能够在 1 毫秒内轻松识别音频事件(例如节拍),但不能在它们发生时识别它们(而是 30-200 毫秒后,具体取决于您的系统)。 这种安排通常适用于计算机,因为人类对事件时间的一般感知远大于一毫秒(除了罕见的专门感知系统,例如比较两只耳朵之间的听觉事件等)。

您的代码的具体问题是 CHUNKS 对于 OS 来说太小了,无法在每个样本中查询声卡。它的频率为 32,因此在 44100Hz 时,OS 需要每 0.7 毫秒到达声卡,这对于负责执行许多其他任务的计算机来说时间太短了。如果您 OS 在下一个块进入之前没有得到该块,则原始块将被覆盖并丢失。

为了让这个工作与上面的约束一致,使 CHUNKS32 大得多,更像 1024(如 PyAudio 示例中所示)。取决于你的电脑和它在做什么,即使我的时间不够长。

如果这种方法不适合您,您可能需要像 Arduino 这样的专用实时系统。 (不过,一般来说,这不是必需的,所以在决定使用 Arduino 之前请三思。通常,当我看到人们需要真正的实时时,就是在尝试与人类进行非常定量的交互时,就像闪光灯一样,让人们点击一个按钮,让另一盏灯闪光,让人们点击另一个按钮,等等,以测量响应时间。)