从音频信号中提取乐器品质

Extracting Instrument Qualities From Audio Signal

我正在寻找一个函数来获取音频信号(假设它包含一个单一的乐器演奏),我想从音频中提取类似乐器的特征并将其提取到向量中space。所以理论上,如果我有两个信号具有相似的声音乐器(例如两架钢琴),它们各自的向量应该非常相似(由 euclidian distance/cosine similarity/etc)。怎么做到这一点?

我尝试过的:我目前正在提取(并暂时平均)色度能量、光谱对比度、MFCC(及其一阶和二阶导数)以及梅尔频谱图,并将它们连接成一个单一的表示向量:

# expects a numpy array (dimensions: [1, num_samples], 
# similar to torchaudio.load() output). 

# assume all signals contain a constant number of samples and sampled at 44.1Khz
def extract_instrument_features(signal, sr):
  # define hyperparameters:
  FRAME_LENGTH = 1024
  HOP_LENGTH = 512

  # compute and perform temporal averaging of the chroma energy:
  ce = torch.Tensor(librosa.feature.chroma_cens(signal_np, sr))
  ce = torch.mean(ce, axis=1)

  # compute and perform temporal averaging of the spectral contrast:
  spc = torch.Tensor(librosa.feature.spectral_contrast(signal_np, sr))
  spc = torch.mean(spc, axis=1)

  # extract MFCC and its first & second derivatives:
  mfcc = torch.Tensor(librosa.feature.mfcc(signal_np, sr, n_mfcc=13))
  mfcc_1st = torch.Tensor(librosa.feature.delta(mfcc))
  mfcc_2nd = torch.Tensor(librosa.feature.delta(mfcc, order=2))

  # temporal averaging of MFCCs:
  mfcc = torch.mean(mfcc, axis=1)
  mfcc_1st = torch.mean(mfcc_1st, axis=1)
  mfcc_2nd = torch.mean(mfcc_2nd, axis=1)

  # define the mel spectrogram transform:
  mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=target_sample_rate, 
    n_fft=1024, 
    hop_length=512,
    n_mels=64
  )

  # extract the mel spectrogram:
  ms = mel_spectrogram(signal)
  ms = torch.mean(ms, axis=1)[0]

  # concatenate and return the feature vector:
  features = [ce, spc, mfcc, mfcc_1st, mfcc_2nd]
  return np.concatenate(features)

乐器音频中发出独特声音的部分,与演奏的音高无关,称为 timbre。获得向量表示的现代方法是训练神经网络。通常调用这种学习的向量表示来创建 音频嵌入

Learning Disentangled Representations Of Timbre And Pitch For Musical Instrument Sounds Using Gaussian Mixture Variational Autoencoders (2019) 中描述了一个示例实现。