加快从 pandas 中的数据帧读取内容的速度

Accelerating speed of reading contents from dataframe in pandas

让我们假设我们有 table,维度如下:

print(metadata.shape)-(8732, 8)

让我们假设我们想要为每一行读取 slice_file_name(然后从驱动器读取声音文件)并提取梅尔频率:

def feature_extractor(file_name):
  audio,sample_rate =librosa.load(file_name,res_type='kaiser_fast')
  mfccs_features =librosa.feature.mfcc(y=audio,sr=sample_rate,n_mfcc=40)
  mfccs_scaled_features =np.mean(mfccs_features.T,axis=0)
  return mfccs_scaled_features

如果我使用以下循环:

from tqdm import tqdm
extracted_features =[]
for index_num, row in  tqdm(metadata.iterrows()):
    file_name = os.path.join(os.path.abspath(Base_Directory),str(row["slice_file_name"]))
    final_class_labels=row["class"]
    data=feature_extractor(file_name)
    extracted_features.append([data,final_class_labels])

总共需要以下时间:

3555it [21:15,  2.79it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1323
  n_fft, y.shape[-1]
8326it [48:40,  3.47it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1103
  n_fft, y.shape[-1]
8329it [48:41,  3.89it/s]/usr/local/lib/python3.7/dist-packages/librosa/core/spectrum.py:224: UserWarning: n_fft=2048 is too small for input signal of length=1523
  n_fft, y.shape[-1]
8732it [50:53,  2.86it/s]

我如何优化这段代码以在更短的时间内完成这件事?有可能吗?

您可以尝试 运行 并行使用特征提取器,这可以在您的数据框中提供一个新列 mfccs_scaled_features

from pandarallel import pandarallel
pandarallel.initialize()

PATH = os.path.abspath(Base_Directory)

def feature_extractor(file_name):
    # If using windows, you may need to put these here~
    # import librosa 
    # import numpy as np
    # import os
    
    file_name = os.path.join(PATH, file_name)
    audio,sample_rate = librosa.load(file_name, res_type='kaiser_fast')
    mfccs_features = librosa.feature.mfcc(y=audio, sr=sample_rate, n_mfcc=40)
    mfccs_scaled_features = np.mean(mfccs_features.T, axis=0)
    return mfccs_scaled_features

df['mfccs_scaled_features'] = df['slice_file_name'].parallel_apply(feature_extractor)