在 HuggingFace 分词器中:如何简单地在空格上拆分序列?
In HuggingFace tokenizers: how can I split a sequence simply on spaces?
我正在使用 HuggingFace 的 DistilBertTokenizer
分词器。
我想通过在 space:
上简单拆分来标记我的文本
["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."]
而不是像这样的默认行为:
["Do", "n't", "you", "love", "", "Transformers", "?", "We", "sure", "do", "."]
我专门阅读了他们关于 Tokenization in general as well as about BERT Tokenizer 的文档,但找不到这个简单问题的答案:(
我假设它应该是加载 Tokenizer 时的一个参数,但我在参数列表中找不到它...
编辑:
要重现的最小代码示例:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.tokenize("Don't you love Transformers? We sure do.")
print("Tokens: ", tokens)
这不是它的工作原理。 transformers库提供了python字符串的不同类型的tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split()方法:
text = "Don't you love Transformers? We sure do."
tokens = text.split()
print("Tokens: ", tokens)
输出:
Tokens: ["Don't", 'you', 'love', '', 'Transformers?', 'We', 'sure', 'do.']
如果您正在寻找更复杂的标记化,同时考虑标点符号,您可以使用 basic_tokenizer:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.basic_tokenizer.tokenize(text)
print("Tokens: ", tokens)
输出:
Tokens: ['Don', "'", 't', 'you', 'love', '', 'Transformers', '?', 'We', 'sure', 'do', '.']
编辑:正如评论中指出的那样,这不符合我的要求。
这是我尝试过的一个想法:
from transformers import DistilBertModel, DistilBertTokenizer
import torch
text_str = "also du fängst an mit der Stadtrundfahrt"
# create DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-german-cased')
model = DistilBertModel.from_pretrained('distilbert-base-german-cased')
# check if tokens are correct
tokens = tokenizer.basic_tokenizer.tokenize(text_str)
print("Tokens: ", tokens)
# Encode the curent text
input_ids = torch.tensor(tokenizer.encode(tokens)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states.shape)
print(last_hidden_states[0,1:-1].shape)
print(last_hidden_states)
关键是首先使用 BasicTokenizer
(如@cronoik 所建议的那样)拆分标记,然后在编码时使用已经标记化的文本。
我正在使用 HuggingFace 的 DistilBertTokenizer
分词器。
我想通过在 space:
上简单拆分来标记我的文本["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."]
而不是像这样的默认行为:
["Do", "n't", "you", "love", "", "Transformers", "?", "We", "sure", "do", "."]
我专门阅读了他们关于 Tokenization in general as well as about BERT Tokenizer 的文档,但找不到这个简单问题的答案:(
我假设它应该是加载 Tokenizer 时的一个参数,但我在参数列表中找不到它...
编辑: 要重现的最小代码示例:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.tokenize("Don't you love Transformers? We sure do.")
print("Tokens: ", tokens)
这不是它的工作原理。 transformers库提供了python字符串的不同类型的tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split()方法:
text = "Don't you love Transformers? We sure do."
tokens = text.split()
print("Tokens: ", tokens)
输出:
Tokens: ["Don't", 'you', 'love', '', 'Transformers?', 'We', 'sure', 'do.']
如果您正在寻找更复杂的标记化,同时考虑标点符号,您可以使用 basic_tokenizer:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.basic_tokenizer.tokenize(text)
print("Tokens: ", tokens)
输出:
Tokens: ['Don', "'", 't', 'you', 'love', '', 'Transformers', '?', 'We', 'sure', 'do', '.']
编辑:正如评论中指出的那样,这不符合我的要求。
这是我尝试过的一个想法:
from transformers import DistilBertModel, DistilBertTokenizer
import torch
text_str = "also du fängst an mit der Stadtrundfahrt"
# create DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-german-cased')
model = DistilBertModel.from_pretrained('distilbert-base-german-cased')
# check if tokens are correct
tokens = tokenizer.basic_tokenizer.tokenize(text_str)
print("Tokens: ", tokens)
# Encode the curent text
input_ids = torch.tensor(tokenizer.encode(tokens)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]
print(last_hidden_states.shape)
print(last_hidden_states[0,1:-1].shape)
print(last_hidden_states)
关键是首先使用 BasicTokenizer
(如@cronoik 所建议的那样)拆分标记,然后在编码时使用已经标记化的文本。