从标记化字符串中提取 NLP 相关模型的嵌入值

Question

我正在使用 huggingface 管道提取句子中的词嵌入。据我所知，首先一个句子会变成一个标记化的字符串。我认为标记化字符串的长度可能不等于原始句子中的单词数。我需要检索特定句子的词嵌入。

例如，这是我的代码：

#https://discuss.huggingface.co/t/extracting-token-embeddings-from-pretrained-language-models/6834/6

from transformers import pipeline, AutoTokenizer, AutoModel
import numpy as np
import re

model_name = "xlnet-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

tokenizer.add_special_tokens({'pad_token': '[PAD]'})
model.resize_token_embeddings(len(tokenizer))

model_pipeline = pipeline('feature-extraction', model=model_name, tokenizer=tokenizer)

def find_wordNo_sentence(word, sentence):
    
    print(sentence)
    splitted_sen = sentence.split(" ")
    print(splitted_sen)
    index = splitted_sen.index(word)


    for i,w in enumerate(splitted_sen):
        if(word == w):
            return i

    print("not found") #0 base




  
def return_xlnet_embedding(word, sentence):
        
    word = re.sub(r'[^\w]', " ", word)
    word = " ".join(word.split())
    
    sentence = re.sub(r'[^\w]', ' ', sentence)
    sentence = " ".join(sentence.split())
    
    id_word = find_wordNo_sentence(word, sentence)
    
   
        
    try:
        data = model_pipeline(sentence)
        
        n_words = len(sentence.split(" "))
        #print(sentence_emb.shape)
        n_embs  = len(data[0])
        print(n_embs, n_words)
        print(len(data[0]))
    
        if (n_words != n_embs):
            "There is extra tokenized word"
            
            
        results = data[0][id_word]  
        return np.array(results)
    
    except:
        return "word not found"

return_xlnet_embedding('your', "what is your name?")

则输出为：

what is your name ['what', 'is', 'your', 'name'] 6 4 6

所以输入管道的标记化字符串的长度比我的单词数多两个。如何找到（这 6 个值中的）哪一个是我的词的嵌入？

Answer 1

如您所知，huggingface 分词器包含频繁的子词和完整的子词。因此，如果您愿意为某些标记提取词嵌入，您应该考虑可能包含多个向量！此外，huggingface 管道在第一步对输入句子进行编码，这将通过在实际句子的开头和结尾添加特殊标记来执行。

string = 'This is a test for clarification'
print(pipeline.tokenizer.tokenize(string))
print(pipeline.tokenizer.encode(string))

输出：

['this', 'is', 'a', 'test', 'for', 'cl', '##ari', '##fication']

[101, 2023, 2003, 1037, 3231, 2005, 18856, 8486, 10803, 102]

从标记化字符串中提取 NLP 相关模型的嵌入值

Extracting embedding values of NLP pertained models from tokenized strings

python

nlp

tokenize

word-embedding

huggingface-tokenizers