实现具有负对数似然损失的简单概率模型
Implementing simple probabilistic model with negative log likelihood loss
首先快速免责声明是我在 Reddit 上发布了这个问题,首先是在深度学习和学习机器学习中,但我想我也可以在这里请求你的专业知识。事不宜迟:
今年我正在挑战自己Deep Unsupervised Learning Course of Berkeley University,虽然我刚刚开始第 1 周的热身运动,但我已经遇到 'technical' 困难。
题中的练习是以下文档中的“1.热身”:Week 1 Exercises。 (我很抱歉,因为我对 Reddit 格式不够熟悉,无法无缝地包含图像。
在我的理解中,我们有一个变量 x
,它可以从 1..100
中获取值,这是一个特定的采样概率(在 sample_data()
函数中定义)。
因此,任务是拟合传递给 softmax 函数的参数向量 theta
,并且应该给出特定元素 x_i
被采样的可能性。即,theta_1
应该是"bumps up"变量对应的soft-max值的参数x = 1
等等。
使用 Tensorflow,我想我能够创建这样一个模型,但是在训练方面,我相信我遗漏了一个关键点,因为程序无法计算关于 theta
参数的梯度.
我想知道我是否对任务有误解,是否有更好的方法来达到练习的结果。
这是代码,其中失败的 par 位于 # Computing gradients
。
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
if __name__ == "__main__":
# Sampling function of the x variable provided in the exercise
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters theta
w_init = tf.zeros_initializer()
params = tf.Variable(
initial_value=w_init(shape=(1, 100),
dtype='float32'), trainable=True, name='params')
softmax = tf.squeeze( tf.nn.softmax( params, axis=1))
#Should materialize the loss of the model
def get_neg_log_likelihood( inputs):
return - tf.math.log( softmax)
neg_log_likelihoods = get_neg_log_likelihood( softmax)
dist = tfp.distributions.Categorical( probs=softmax, dtype=tf.int32)
optimizer = tf.keras.optimizers.Adam()
for epoch in range( 100):
minibatch_size = 200
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
with tf.GradientTape() as tape:
tape.watch( params)
loss = tf.reduce_mean( - dist.log_prob( x))
# Computing gradients
grads = tape.gradient( loss, params)
print( grads) # Result: None
# input()
optimizer.apply_gradients( zip( grads, params))
提前感谢您的宝贵时间。
PS:我主要有深度强化学习的背景,因此我可以理解那里使用的各种模型(策略,价值函数......),但我正在努力完善我对模型本身的内部结构,即生成概率模型(GAN、VAE)和其他一般的无监督学习模型(RealNVP、Norm Flows,...)
很确定没有人会看到这个,但我想我也可以结束这个。
首先,我通过直接从soft-max值的负对数似然推导其表达式来计算梯度,因此同样放弃了Tensorflow框架。
虽然结果有点低于我的预期,但该程序能够使模型适应与采样数据的经验分布有点相似的分布。我想这是因为仅一维 theta 参数向量不足以完全模拟真实数据分布,以及有限数量的采样数据。
代码的更新版本:
import numpy as np
from matplotlib import pyplot as plt
np.random.seed( 42)
def softmax(X, theta = 1.0, axis = None):
# Shamefull copy paste from SO
y = np.atleast_2d(X)
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
y = y * float(theta)
y = y - np.expand_dims(np.max(y, axis = axis), axis)
y = np.exp(y)
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
p = y / ax_sum
if len(X.shape) == 1: p = p.flatten()
return p
if __name__ == "__main__":
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters
params = np.zeros(100)
# Use for loss computation
def get_neg_log_likelihood( softmax):
return - np.log( softmax)
def get_loss( params, x):
return np.mean( [get_neg_log_likelihood( softmax( params))[i-1] for i in x])
lr = .0005
for epoch in range( 1000):
# Shuffling training data
np.random.shuffle( train_ds)
minibatch_size = 100
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
smax = softmax( params)
# Jacobian of neg log likelishood
jacobian = [[ smax[j] - 1 if i == j else
smax[j] for j in range(100)] for i in range(100)]
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
# Compute the gradient matrix for each sample data and mean over it
grad_matrix = np.vstack( [jacobian[i] for i in x])
grads = np.sum( grad_matrix, axis=0)
params -= lr * grads
print( "Epoch %d -- Train loss: %.4f , Val loss: %.4f" %(epoch, get_loss( params, train_ds), get_loss( params, val_ds)))
# Plotting each ~100 epochs
if epoch % 100 == 0:
counters = { i+1: 0 for i in range(100)}
for x in full_data:
counters[x]+= 1
histogram = np.array( [ counters[i+1] / len( full_data) for i in range( 100)])
fsmax = softmax( params)
fig, ax = plt.subplots()
ax.set_title('Dist. Comp. after %d epochs of training (from scratch)' % epoch)
x = np.arange( 1,101)
width = 0.35
rects1 = ax.bar(x - width/2, fsmax, width, label='Model')
rects2 = ax.bar(x + width/2, histogram, width, label='Empirical')
ax.set_ylabel('Likelihood')
ax.set_xlabel('Variable x\s values')
ax.legend()
def autolabel(rects):
for rect in rects:
height = rect.get_height()
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.savefig( 'plots/results_after_%d_epochs.png' % epoch)
为了完整起见,包含最终模型分布的图片。 Modeled vs Empirical Distribution
首先快速免责声明是我在 Reddit 上发布了这个问题,首先是在深度学习和学习机器学习中,但我想我也可以在这里请求你的专业知识。事不宜迟:
今年我正在挑战自己Deep Unsupervised Learning Course of Berkeley University,虽然我刚刚开始第 1 周的热身运动,但我已经遇到 'technical' 困难。
题中的练习是以下文档中的“1.热身”:Week 1 Exercises。 (我很抱歉,因为我对 Reddit 格式不够熟悉,无法无缝地包含图像。
在我的理解中,我们有一个变量 x
,它可以从 1..100
中获取值,这是一个特定的采样概率(在 sample_data()
函数中定义)。
因此,任务是拟合传递给 softmax 函数的参数向量 theta
,并且应该给出特定元素 x_i
被采样的可能性。即,theta_1
应该是"bumps up"变量对应的soft-max值的参数x = 1
等等。
使用 Tensorflow,我想我能够创建这样一个模型,但是在训练方面,我相信我遗漏了一个关键点,因为程序无法计算关于 theta
参数的梯度.
我想知道我是否对任务有误解,是否有更好的方法来达到练习的结果。
这是代码,其中失败的 par 位于 # Computing gradients
。
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
if __name__ == "__main__":
# Sampling function of the x variable provided in the exercise
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters theta
w_init = tf.zeros_initializer()
params = tf.Variable(
initial_value=w_init(shape=(1, 100),
dtype='float32'), trainable=True, name='params')
softmax = tf.squeeze( tf.nn.softmax( params, axis=1))
#Should materialize the loss of the model
def get_neg_log_likelihood( inputs):
return - tf.math.log( softmax)
neg_log_likelihoods = get_neg_log_likelihood( softmax)
dist = tfp.distributions.Categorical( probs=softmax, dtype=tf.int32)
optimizer = tf.keras.optimizers.Adam()
for epoch in range( 100):
minibatch_size = 200
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
with tf.GradientTape() as tape:
tape.watch( params)
loss = tf.reduce_mean( - dist.log_prob( x))
# Computing gradients
grads = tape.gradient( loss, params)
print( grads) # Result: None
# input()
optimizer.apply_gradients( zip( grads, params))
提前感谢您的宝贵时间。
PS:我主要有深度强化学习的背景,因此我可以理解那里使用的各种模型(策略,价值函数......),但我正在努力完善我对模型本身的内部结构,即生成概率模型(GAN、VAE)和其他一般的无监督学习模型(RealNVP、Norm Flows,...)
很确定没有人会看到这个,但我想我也可以结束这个。
首先,我通过直接从soft-max值的负对数似然推导其表达式来计算梯度,因此同样放弃了Tensorflow框架。
虽然结果有点低于我的预期,但该程序能够使模型适应与采样数据的经验分布有点相似的分布。我想这是因为仅一维 theta 参数向量不足以完全模拟真实数据分布,以及有限数量的采样数据。
代码的更新版本:
import numpy as np
from matplotlib import pyplot as plt
np.random.seed( 42)
def softmax(X, theta = 1.0, axis = None):
# Shamefull copy paste from SO
y = np.atleast_2d(X)
if axis is None:
axis = next(j[0] for j in enumerate(y.shape) if j[1] > 1)
y = y * float(theta)
y = y - np.expand_dims(np.max(y, axis = axis), axis)
y = np.exp(y)
ax_sum = np.expand_dims(np.sum(y, axis = axis), axis)
p = y / ax_sum
if len(X.shape) == 1: p = p.flatten()
return p
if __name__ == "__main__":
def sample_data():
count = 10000
rand = np.random.RandomState(0)
a = 0.3 + 0.1 * rand.randn(count)
b = 0.8 + 0.05 * rand.randn(count)
mask = rand.rand(count) < 0.5
samples = np.clip(a * mask + b * (1 - mask), 0.0, 1.0)
return np.digitize(samples, np.linspace(0.0, 1.0, 100))
full_data = sample_data()
train_ds = full_data[:int(.8*len( full_data))]
val_ds = full_data[int(.8*len( full_data)):]
# Declaring parameters
params = np.zeros(100)
# Use for loss computation
def get_neg_log_likelihood( softmax):
return - np.log( softmax)
def get_loss( params, x):
return np.mean( [get_neg_log_likelihood( softmax( params))[i-1] for i in x])
lr = .0005
for epoch in range( 1000):
# Shuffling training data
np.random.shuffle( train_ds)
minibatch_size = 100
n_minibatches = len( train_ds) // minibatch_size
# Running over minibatches of the data
for minibatch in range( n_minibatches):
smax = softmax( params)
# Jacobian of neg log likelishood
jacobian = [[ smax[j] - 1 if i == j else
smax[j] for j in range(100)] for i in range(100)]
# Minibatching
start_index = (minibatch*minibatch_size)
end_index = (minibatch_size*minibatch + minibatch_size)
x = train_ds[start_index:end_index]
# Compute the gradient matrix for each sample data and mean over it
grad_matrix = np.vstack( [jacobian[i] for i in x])
grads = np.sum( grad_matrix, axis=0)
params -= lr * grads
print( "Epoch %d -- Train loss: %.4f , Val loss: %.4f" %(epoch, get_loss( params, train_ds), get_loss( params, val_ds)))
# Plotting each ~100 epochs
if epoch % 100 == 0:
counters = { i+1: 0 for i in range(100)}
for x in full_data:
counters[x]+= 1
histogram = np.array( [ counters[i+1] / len( full_data) for i in range( 100)])
fsmax = softmax( params)
fig, ax = plt.subplots()
ax.set_title('Dist. Comp. after %d epochs of training (from scratch)' % epoch)
x = np.arange( 1,101)
width = 0.35
rects1 = ax.bar(x - width/2, fsmax, width, label='Model')
rects2 = ax.bar(x + width/2, histogram, width, label='Empirical')
ax.set_ylabel('Likelihood')
ax.set_xlabel('Variable x\s values')
ax.legend()
def autolabel(rects):
for rect in rects:
height = rect.get_height()
autolabel(rects1)
autolabel(rects2)
fig.tight_layout()
plt.savefig( 'plots/results_after_%d_epochs.png' % epoch)
为了完整起见,包含最终模型分布的图片。 Modeled vs Empirical Distribution