google的音频集中使用什么算法提取音频特征？

What algorithm is used for audio feature extraction in google's audioset?

我开始使用 Google 的 Audioset。虽然数据集很广泛，但我发现有关音频特征提取的信息非常模糊。该网站提到

128-dimensional audio features extracted at 1Hz. The audio features were extracted using a VGG-inspired acoustic model described in Hershey et. al., trained on a preliminary version of YouTube-8M. The features were PCA-ed and quantized to be compatible with the audio features provided with YouTube-8M. They are stored as TensorFlow Record files.

在 paper 中，作者讨论了在 960 毫秒块上使用梅尔频谱图来获得 96x64 表示。然后我不清楚他们如何获得 Audioset 中使用的 1x128 格式表示。有人知道更多吗？？

他们使用 96*64 数据作为修改后的 VGG network.The 的输入，VGG 的最后一层是 FC-128，所以它的输出将是 1*128，就是这个原因。

VGG 的架构可以在这里找到：https://github.com/tensorflow/models/blob/master/research/audioset/vggish_slim.py

google的音频集中使用什么算法提取音频特征？

What algorithm is used for audio feature extraction in google's audioset?

audio

machine-learning

sound-recognition