Tensorflow 负采样

Question

我正在尝试遵循关于 tensorflow 的大胆教程，在该教程中我遇到了以下两行词嵌入模型：

  # Look up embeddings for inputs.
  embed = tf.nn.embedding_lookup(embeddings, train_dataset)
  # Compute the softmax loss, using a sample of the negative labels each time.
  loss = tf.reduce_mean(tf.nn.sampled_softmax_loss(softmax_weights, softmax_biases, 
                        embed, train_labels, num_sampled, vocabulary_size))

现在我明白了，第二个语句是对负标签进行抽样。但问题是它怎么知道负面标签是什么？我提供的第二个函数是当前输入及其相应的标签以及我想要（负）采样的标签数量。从输入集本身抽样不存在风险吗？

这是完整的示例：https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/udacity/5_word2vec.ipynb

Answer 1

您可以找到 tf.nn.sampled_softmax_loss() here. There is even a good explanation of Candidate Sampling provided by TensorFlow here (pdf) 的文档。

How does it know what the negative labels are?

TensorFlow 会在所有可能的类中随机 select 否定类（对你来说，所有可能的词）。

Isn't there the risk of sampling from the input set in itself?

当你想计算真实标签的 softmax 概率时，你计算：logits[true_label] / sum(logits[negative_sampled_labels]。由于类的数量巨大（词汇量），因此将 true_label 采样为负标签的概率很小。
无论如何，我认为 TensorFlow 在随机抽样时完全消除了这种可能性。（编辑：@Alex 确认 TensorFlow 默认执行此操作）

Answer 2

Candidate sampling解释采样损失函数是如何计算的：

计算所有训练样本L的子集C中的损失函数，其中C = T⋃S ，T是目标classes中的样本，S是所有classes。

您提供的代码使用 tf.nn.embedding_lookup 获取输入 [batch_size, dim] embed。

然后用tf.nn.sampled_softmax_loss得到采样损失函数：

softmax_weights：形状为 [num_classes, dim] 的张量。
softmax_biases：形状为 [num_classes] 的张量。 class 偏见。
嵌入：形状为 [batch_size, dim] 的张量。
train_labels：形状为 [batch_size, 1] 的张量。目标 classes T。
num_sampled：一个整数。每批随机抽样的 classes 的数量。 S.
vocabulary_size：可能的class个数。
sampled_values：默认为log_uniform_candidate_sampler

对于一批，目标样本只是train_labels（T）。它从embed中随机选择num_sampled个样本（S）作为负样本。

相对于softmax_wiehgt和softmax_bias从embed统一采样。由于 embed 是嵌入 [train_dataset]（形状为 [batch_size，embedding_size]），如果嵌入[train_dataset[i]] 包含 train_labels[i]，它有可能被选回来，那么它就不是负标签。

根据 Candidate sampling 第 2 页，有不同的类型。对于NCE和负采样，NEG=S，其中可能包含T的一部分；对于采样逻辑、采样 softmax，NEG = S-T 显式删除 T。

确实，这可能是从 train_ set 中采样的机会。

Tensorflow 负采样

Tensorflow negative sampling

python

tensorflow