如何在 .wav 文件中查找特定声音的时间戳?
How to find timestamps of a specific sound in .wav file?
我有一个 .wav 文件,我录制了自己的声音并讲了几分钟。假设我想找到我在音频中说“迈克”的确切时间。我研究了语音识别并使用 Google 语音 API 进行了一些测试,但我得到的时间戳远非准确。
作为替代方案,我录制了一个非常短的 .wav 文件,我刚才说了“Mike”。我试图比较这两个 .wav 文件并找到在较长的 .wav 文件中说“Mike”的每个时间戳。我遇到了 SleuthEye's amazing
此代码非常适合仅查找一个时间戳,但我不知道如何找到多个 start/end 次:
import numpy as np
import sys
from scipy.io import wavfile
from scipy import signal
snippet = sys.argv[1]
source = sys.argv[2]
# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet, dtype='float')
# read the source
rate, source = wavfile.read(source);
source = np.array(source, dtype='float')
# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
num = int(np.round(rate*len(snippet)/rate_snippet))
snippet = signal.resample(snippet, num)
# compute the cross-correlation
z = signal.correlate(source, snippet);
peak = np.argmax(np.abs(z))
start = (peak-len(snippet)+1)/rate
end = peak/rate
print("start {} end {}".format(start, end))
你快到了。您可以使用 find_peaks
。例如
import numpy as np
from scipy.io import wavfile
from scipy import signal
import matplotlib.pyplot as plt
snippet = 'snippet.wav'
source = 'source.wav'
# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet[:,0], dtype='float')
# read the source
rate, source = wavfile.read(source);
source = np.array(source[:,0], dtype='float')
# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
num = int(np.round(rate*len(snippet)/rate_snippet))
snippet = signal.resample(snippet, num)
我的来源和片段
x_snippet = np.arange(0, snippet.size) / rate_snippet
plt.plot(x_snippet, snippet)
plt.xlabel('seconds')
plt.title('snippet')
x_source = np.arange(0, source.size) / rate
plt.plot(x_source, source)
plt.xlabel('seconds')
plt.title('source')
现在我们得到了相关性
# compute the cross-correlation
z = signal.correlate(source, snippet, mode='same')
我使用了mode='same'
,所以source
和z
的长度相同
source.size == z.size
True
现在,我们可以定义最小峰高,例如
x_z = np.arange(0, z.size) / rate
plt.plot(x_z, z)
plt.axhline(2e20, color='r')
plt.title('correlation')
并在最小距离内找到峰值(您可能必须根据您的样本定义自己的 height
和 distance
)
peaks = signal.find_peaks(
z,
height=2e20,
distance=50000
)
peaks
(array([ 117390, 225754, 334405, 449319, 512001, 593854, 750686,
873026, 942586, 1064083]),
{'peak_heights': array([8.73666562e+20, 9.32871542e+20, 7.23883305e+20, 9.30772354e+20,
4.32924341e+20, 9.18323020e+20, 1.12473608e+21, 1.07752019e+21,
1.12455724e+21, 1.05061734e+21])})
我们取峰 idxs
peaks_idxs = peaks[0]
plt.plot(x_z, z)
plt.plot(x_z[peaks_idxs], z[peaks_idxs], 'or')
因为它们“几乎”在我们可以做的片段中间
fig, ax = plt.subplots(figsize=(12, 5))
plt.plot(x_source, source)
plt.xlabel('seconds')
plt.title('source signal and correlatation')
for i, peak_idx in enumerate(peaks_idxs):
start = (peak_idx-snippet.size/2) / rate
center = (peak_idx) / rate
end = (peak_idx+snippet.size/2) / rate
plt.axvline(start, color='g')
plt.axvline(center, color='y')
plt.axvline(end, color='r')
print(f"peak {i}: start {start:.2f} end {end:.2f}")
peak 0: start 2.34 end 2.98
peak 1: start 4.80 end 5.44
peak 2: start 7.27 end 7.90
peak 3: start 9.87 end 10.51
peak 4: start 11.29 end 11.93
peak 5: start 13.15 end 13.78
peak 6: start 16.71 end 17.34
peak 7: start 19.48 end 20.11
peak 8: start 21.06 end 21.69
peak 9: start 23.81 end 24.45
但也许有更好的方法来更精确地定义开始和结束。
我有一个 .wav 文件,我录制了自己的声音并讲了几分钟。假设我想找到我在音频中说“迈克”的确切时间。我研究了语音识别并使用 Google 语音 API 进行了一些测试,但我得到的时间戳远非准确。
作为替代方案,我录制了一个非常短的 .wav 文件,我刚才说了“Mike”。我试图比较这两个 .wav 文件并找到在较长的 .wav 文件中说“Mike”的每个时间戳。我遇到了 SleuthEye's amazing
此代码非常适合仅查找一个时间戳,但我不知道如何找到多个 start/end 次:
import numpy as np
import sys
from scipy.io import wavfile
from scipy import signal
snippet = sys.argv[1]
source = sys.argv[2]
# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet, dtype='float')
# read the source
rate, source = wavfile.read(source);
source = np.array(source, dtype='float')
# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
num = int(np.round(rate*len(snippet)/rate_snippet))
snippet = signal.resample(snippet, num)
# compute the cross-correlation
z = signal.correlate(source, snippet);
peak = np.argmax(np.abs(z))
start = (peak-len(snippet)+1)/rate
end = peak/rate
print("start {} end {}".format(start, end))
你快到了。您可以使用 find_peaks
。例如
import numpy as np
from scipy.io import wavfile
from scipy import signal
import matplotlib.pyplot as plt
snippet = 'snippet.wav'
source = 'source.wav'
# read the sample to look for
rate_snippet, snippet = wavfile.read(snippet);
snippet = np.array(snippet[:,0], dtype='float')
# read the source
rate, source = wavfile.read(source);
source = np.array(source[:,0], dtype='float')
# resample such that both signals are at the same sampling rate (if required)
if rate != rate_snippet:
num = int(np.round(rate*len(snippet)/rate_snippet))
snippet = signal.resample(snippet, num)
我的来源和片段
x_snippet = np.arange(0, snippet.size) / rate_snippet
plt.plot(x_snippet, snippet)
plt.xlabel('seconds')
plt.title('snippet')
x_source = np.arange(0, source.size) / rate
plt.plot(x_source, source)
plt.xlabel('seconds')
plt.title('source')
现在我们得到了相关性
# compute the cross-correlation
z = signal.correlate(source, snippet, mode='same')
我使用了mode='same'
,所以source
和z
的长度相同
source.size == z.size
True
现在,我们可以定义最小峰高,例如
x_z = np.arange(0, z.size) / rate
plt.plot(x_z, z)
plt.axhline(2e20, color='r')
plt.title('correlation')
并在最小距离内找到峰值(您可能必须根据您的样本定义自己的 height
和 distance
)
peaks = signal.find_peaks(
z,
height=2e20,
distance=50000
)
peaks
(array([ 117390, 225754, 334405, 449319, 512001, 593854, 750686,
873026, 942586, 1064083]),
{'peak_heights': array([8.73666562e+20, 9.32871542e+20, 7.23883305e+20, 9.30772354e+20,
4.32924341e+20, 9.18323020e+20, 1.12473608e+21, 1.07752019e+21,
1.12455724e+21, 1.05061734e+21])})
我们取峰 idxs
peaks_idxs = peaks[0]
plt.plot(x_z, z)
plt.plot(x_z[peaks_idxs], z[peaks_idxs], 'or')
因为它们“几乎”在我们可以做的片段中间
fig, ax = plt.subplots(figsize=(12, 5))
plt.plot(x_source, source)
plt.xlabel('seconds')
plt.title('source signal and correlatation')
for i, peak_idx in enumerate(peaks_idxs):
start = (peak_idx-snippet.size/2) / rate
center = (peak_idx) / rate
end = (peak_idx+snippet.size/2) / rate
plt.axvline(start, color='g')
plt.axvline(center, color='y')
plt.axvline(end, color='r')
print(f"peak {i}: start {start:.2f} end {end:.2f}")
peak 0: start 2.34 end 2.98
peak 1: start 4.80 end 5.44
peak 2: start 7.27 end 7.90
peak 3: start 9.87 end 10.51
peak 4: start 11.29 end 11.93
peak 5: start 13.15 end 13.78
peak 6: start 16.71 end 17.34
peak 7: start 19.48 end 20.11
peak 8: start 21.06 end 21.69
peak 9: start 23.81 end 24.45
但也许有更好的方法来更精确地定义开始和结束。