从 HuggingFace 中的 wav2vec2 模型获取嵌入

Question

我正在尝试使用我自己的数据集从预训练的 wav2vec2 模型（例如，来自 jonatasgrosman/wav2vec2-large-xlsr-53-german）中获取嵌入。

我的目标是将这些功能用于下游任务（不是专门的语音识别）。即，由于数据集相对较小，我会使用这些嵌入训练 SVM 以进行最终分类。

到目前为止我试过这个：

model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)

input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True, 
                                 feature_size=1, sampling_rate=16000 ).input_values

然后，我不确定这里的embeddings是否对应last_hidden_states:

的序列

hidden_states = model(input_values).last_hidden_state

或者模型最后一个conv层的特征序列：

features_last_cnn_layer = model(input_values).extract_features

此外，这是从预训练模型中提取特征的正确方法吗？

如何从特定层获得嵌入？

PD: 由于 HuggingFace 的论坛似乎不太活跃，所以在这里发帖。

Answer 1

只需检查 documentation:

last_hidden_state (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)) – Sequence of hidden-states at the output of the last layer of the model.

extract_features (torch.FloatTensor of shape (batch_size, sequence_length, conv_dim[-1])) – Sequence of extracted feature vectors of the last convolutional layer of the model.

last_hidden_state 向量表示所谓的 contextualized embeddings（即每个特征（CNN 输出）都有一个向量表示，在某种程度上受序列中其他标记的影响）。
extract_features 向量表示输入的嵌入（在 CNN 之后）。 .

Also, is this the correct way to extract features from a pre-trained model?
Yes.

How one can get embeddings from a specific layer? Set output_hidden_states=True:

o = model(input_values,output_hidden_states=True)
o.keys()

输出：

odict_keys(['last_hidden_state', 'extract_features', 'hidden_states'])

hidden_states 值包含每个注意力层的嵌入和上下文嵌入。

P.S.: jonatasgrosman/wav2vec2-large-xlsr-53-german 模型是用 feat_extract_norm==层训练的。这意味着，您还应该将注意力掩码传递给模型：

model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)

i= feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True, 
                                 feature_size=1, sampling_rate=16000 )
model(**i)

从 HuggingFace 中的 wav2vec2 模型获取嵌入

Getting embeddings from wav2vec2 models in HuggingFace

python

pre-trained-model

huggingface-transformers