尝试使用 VAD（语音 Activity 检测器）检测语音

Question

我能够读取音频，但在将其传递给 VAD（语音 Activity 检测器）时收到错误消息。我认为错误消息是因为帧以字节为单位，当将它提供给 vad.is_speech(frame, sample_rate) 时，这个帧应该以字节为单位吗？这是下面的代码：

frame_duration_ms=10
duration_in_ms = (frame_duration_ms / 1000) #duration in 10ms
frame_size = int(sample_rate * duration_in_ms) #frame size of 160
frame_bytes = frame_size * 2

def frame_generator(buffer, frame_bytes):
    # repeatedly store 320 length array to the frame_stored when the frame_bytes is less than the size of the buffer
    while offset+frame_bytes < len(buffer):
        frame_stored = buffer[offset : offset+frame_bytes]
        offset = offset + frame_bytes
 return frame_stored
num_padding_frames = int(padding_duration_ms / frame_duration_ms)
# use deque for the sliding window
ring_buffer = deque(maxlen=num_padding_frames)
# we have two states TRIGGERED and NOTTRIGGERED state
triggered = True #NOTTRIGGERED state

frames = frame_generator(buffer, frame_bytes)

speech_frame = []
for frame in frames:
    is_speech = vad.is_speech(frame, sample_rate)

Here is the error message:

TypeError Traceback (most recent call last) in 16 speech_frame = [] 17 for frame in frames: ---> 18 is_speech = vad.is_speech(frame, sample_rate) 19 #print(frames)

C:\Program Files\Python38\lib\site-packages\webrtcvad.py in is_speech(self, buf, sample_rate, length) 20 21 def is_speech(self, buf, sample_rate, length=None): ---> 22 length = length or int(len(buf) / 2) 23 if length * 2 > len(buf): 24 raise IndexError(

TypeError: object of type 'int' has no len()

Answer 1

我已经解决了，你知道vad.is_speech(buf=frame, sample_rate)，它取buf并计算它的长度，但是整数值不具有python中的len()属性。这会引发错误，例如：

num = 1
print(len(num))

改用这个：

data = [1,2,3,4]
print(len(data))

所以这里是对下面代码的更正：

frame_duration_ms=10
duration_in_ms = (frame_duration_ms / 1000) #duration in 10ms
frame_size = int(sample_rate * duration_in_ms) #frame size of 160
frame_bytes = frame_size * 2

values = []

def frame_generator(buffer, frame_bytes):
    # repeatedly store 320 length array to the frame_stored when the frame_bytes is less than the size of the buffer
    while offset+frame_bytes < len(buffer):
        frame_stored = buffer[offset : offset+frame_bytes]
        offset = offset + frame_bytes
        values.append(frame_stored)
 return values
num_padding_frames = int(padding_duration_ms / frame_duration_ms)
# use deque for the sliding window
ring_buffer = deque(maxlen=num_padding_frames)
# we have two states TRIGGERED and NOTTRIGGERED state
triggered = True #NOTTRIGGERED state

frames = frame_generator(buffer, frame_bytes)

frame = []
for frame in frames:
    is_speech = vad.is_speech(frame, sample_rate)

尝试使用 VAD（语音 Activity 检测器）检测语音

Trying to detect speech using VAD(Voice Activity Detector)

python

pyaudio