计算数据集单词的百分位数和 Tensorflow-hub 模型
Calculating percentile of dataset words and Tensorflow-hub model
我想计算 tensorflow-hub 模型中存在的数据集单词的百分位数(例如 ELMo
或 Universal Sentence Encoder
)。对于像GloVe
这样的本地模型,我使用了一种朴素的方法:读取本地模型,将其传输到集合,然后计算百分位数:
f = open('../glove.6B.100d.txt', encoding="utf8")
#Read all the word into a list
...
intersect_words = set(dataset_words).intersect(glove_words)
percentile = len(intersect_words)/len(dataset_words)*100
对于 Tenorflow-hub 模型有没有类似的方法?
对于某些模型,词汇表在 SavedModel 协议缓冲区中序列化(如 USE 和 ELMo),因此必须在 SavedModel 中手动找到它并提取它(我使用逻辑从 USE 中提取词汇表来自 here):
import tensorflow_hub as hub
from tensorflow.python.saved_model.loader_impl import parse_saved_model
# This caches the model at `model_path`.
hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
model_path = '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
saved_model = parse_saved_model(model_path)
# The location of the tensor holding the vocab is model-specific.
graph = saved_model.meta_graphs[0].graph_def
function_ = graph.library.function
embedding_node = function_[5].node_def[1] # Node name is "Embedding_words".
words_tensor = embedding_node.attr.get("value").tensor
word_list = [s.decode('utf-8') for s in words_tensor.string_val]
word_list[100:105] # ['best', ',▁but', 'no', 'any', 'more']
对于 google/Wiki-words-500/2 等其他模型,我们更幸运,因为词汇表已导出到 assets/
目录:
hub.load("https://tfhub.dev/google/Wiki-words-500/2")
!head /tmp/tfhub_modules/bf115a5fe517f019bebae05b433eaeee6415f5bf/assets/tokens.txt -n 40000 | tail
# Antisense
# Antiseptic
# Antiseptics
我想计算 tensorflow-hub 模型中存在的数据集单词的百分位数(例如 ELMo
或 Universal Sentence Encoder
)。对于像GloVe
这样的本地模型,我使用了一种朴素的方法:读取本地模型,将其传输到集合,然后计算百分位数:
f = open('../glove.6B.100d.txt', encoding="utf8")
#Read all the word into a list
...
intersect_words = set(dataset_words).intersect(glove_words)
percentile = len(intersect_words)/len(dataset_words)*100
对于 Tenorflow-hub 模型有没有类似的方法?
对于某些模型,词汇表在 SavedModel 协议缓冲区中序列化(如 USE 和 ELMo),因此必须在 SavedModel 中手动找到它并提取它(我使用逻辑从 USE 中提取词汇表来自 here):
import tensorflow_hub as hub
from tensorflow.python.saved_model.loader_impl import parse_saved_model
# This caches the model at `model_path`.
hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
model_path = '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
saved_model = parse_saved_model(model_path)
# The location of the tensor holding the vocab is model-specific.
graph = saved_model.meta_graphs[0].graph_def
function_ = graph.library.function
embedding_node = function_[5].node_def[1] # Node name is "Embedding_words".
words_tensor = embedding_node.attr.get("value").tensor
word_list = [s.decode('utf-8') for s in words_tensor.string_val]
word_list[100:105] # ['best', ',▁but', 'no', 'any', 'more']
对于 google/Wiki-words-500/2 等其他模型,我们更幸运,因为词汇表已导出到 assets/
目录:
hub.load("https://tfhub.dev/google/Wiki-words-500/2")
!head /tmp/tfhub_modules/bf115a5fe517f019bebae05b433eaeee6415f5bf/assets/tokens.txt -n 40000 | tail
# Antisense
# Antiseptic
# Antiseptics