分词器改变词汇条目
Tokenizers change vocabulary entry
我有一些文本想对其执行 NLP。为此,我下载了一个预训练的分词器,如下所示:
import transformers as ts
pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')
然后我用我的数据创建自己的分词器,如下所示:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(['transcripts.raw'], trainer)
现在到了让我感到困惑的部分......我需要更新预翻译分词器 (pr_tokenizer
) 中的条目,它们的键与我的分词器 (tokenizer
中的键相同]).我尝试了几种方法,这里是其中一种:
new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()
for i in v:
if i in new_vocab:
new_vocab[i] = v[i]
那我现在该怎么办?我在想:
pr_tokenizer.vocab.update(new_vocab)
或
pr_tokenizer.vocab = new_vocab
都不行。有谁知道这样做的好方法吗?
如果你能在你的电脑上找到 distilbert 文件夹,你会看到词汇表基本上是只包含一列的 txt 文件。你可以做任何你想做的事情。
# i download the model with pasting this line to python terminal (or your main cli)
# git clone https://huggingface.co/distilbert-base-uncased
import os
path= "C:/Users/...../distilbert-base-uncased"
print(os.listdir(path))
# ['.git',
# '.gitattributes',
# 'config.json',
# 'flax_model.msgpack',
# 'pytorch_model.bin',
# 'README.md', 'rust_model.ot',
# 'tf_model.h5',
# 'tokenizer.json',
# 'tokenizer_config.json',
# 'vocab.txt']
为此,您只需从 GitHub 或 HuggingFace website 下载分词器源代码到与您的代码相同的文件夹中,然后在加载分词器之前编辑词汇表:
new_vocab = {}
# Getting the vocabulary entries
for i, row in enumerate(open('./distilbert-base-uncased/vocab.txt', 'r')):
new_vocab[row[:-1]] = i
# your vocabulary entries
v = tokenizer.get_vocab()
# replace common (your code)
for i in v:
if i in new_vocab:
new_vocab[i] = v[i]
with open('./distilbert-base-uncased/vocabb.txt', 'w') as f:
# reversed vocabulary
rev_vocab = {j:i for i,j in zip(new_vocab.keys(), new_vocab.values())}
# adding vocabulary entries to file
for i in range(len(rev_vocab)):
if i not in rev_vocab: continue
f.write(rev_vocab[i] + '\n')
# loading the new tokenizer
pr_tokenizer = ts.AutoTokenizer.from_pretrained('./distilbert-base-uncased')
我有一些文本想对其执行 NLP。为此,我下载了一个预训练的分词器,如下所示:
import transformers as ts
pr_tokenizer = ts.AutoTokenizer.from_pretrained('distilbert-base-uncased', cache_dir='tmp')
然后我用我的数据创建自己的分词器,如下所示:
from tokenizers import Tokenizer
from tokenizers.models import BPE
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()
tokenizer.train(['transcripts.raw'], trainer)
现在到了让我感到困惑的部分......我需要更新预翻译分词器 (pr_tokenizer
) 中的条目,它们的键与我的分词器 (tokenizer
中的键相同]).我尝试了几种方法,这里是其中一种:
new_vocab = pr_tokenizer.vocab
v = tokenizer.get_vocab()
for i in v:
if i in new_vocab:
new_vocab[i] = v[i]
那我现在该怎么办?我在想:
pr_tokenizer.vocab.update(new_vocab)
或
pr_tokenizer.vocab = new_vocab
都不行。有谁知道这样做的好方法吗?
如果你能在你的电脑上找到 distilbert 文件夹,你会看到词汇表基本上是只包含一列的 txt 文件。你可以做任何你想做的事情。
# i download the model with pasting this line to python terminal (or your main cli)
# git clone https://huggingface.co/distilbert-base-uncased
import os
path= "C:/Users/...../distilbert-base-uncased"
print(os.listdir(path))
# ['.git',
# '.gitattributes',
# 'config.json',
# 'flax_model.msgpack',
# 'pytorch_model.bin',
# 'README.md', 'rust_model.ot',
# 'tf_model.h5',
# 'tokenizer.json',
# 'tokenizer_config.json',
# 'vocab.txt']
为此,您只需从 GitHub 或 HuggingFace website 下载分词器源代码到与您的代码相同的文件夹中,然后在加载分词器之前编辑词汇表:
new_vocab = {}
# Getting the vocabulary entries
for i, row in enumerate(open('./distilbert-base-uncased/vocab.txt', 'r')):
new_vocab[row[:-1]] = i
# your vocabulary entries
v = tokenizer.get_vocab()
# replace common (your code)
for i in v:
if i in new_vocab:
new_vocab[i] = v[i]
with open('./distilbert-base-uncased/vocabb.txt', 'w') as f:
# reversed vocabulary
rev_vocab = {j:i for i,j in zip(new_vocab.keys(), new_vocab.values())}
# adding vocabulary entries to file
for i in range(len(rev_vocab)):
if i not in rev_vocab: continue
f.write(rev_vocab[i] + '\n')
# loading the new tokenizer
pr_tokenizer = ts.AutoTokenizer.from_pretrained('./distilbert-base-uncased')