Tensorflow - 批处理问题
Tensorflow - batching issues
我对 tensorflow 很陌生,我正在尝试使用批处理从我的 csv 文件进行训练。
这是我读取 csv 文件并进行批处理的代码
filename_queue = tf.train.string_input_producer(
['BCHARTS-BITSTAMPUSD.csv'], shuffle=False, name='filename_queue')
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)
# collect batches of csv in
train_x_batch, train_y_batch = \
tf.train.batch([xy[0:-1], xy[-1:]], batch_size=100)
这是训练用的:
# initialize
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
# train my model
for epoch in range(training_epochs):
avg_cost = 0
total_batch = int(2193 / batch_size)
for i in range(total_batch):
batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch])
feed_dict = {X: batch_xs, Y: batch_ys}
c, _ = sess.run([cost, optimizer], feed_dict=feed_dict)
avg_cost += c / total_batch
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))
coord.request_stop()
coord.join(threads)
这是我的问题:
1.
我的 csv 文件有 2193 条记录,我的批次大小是 100。所以我想要的是:在每个 'epoch' 中以 'first record' 开始,并训练 21 个批次和 100 条记录,最后1批93条记录。所以总共22批。
然而,我发现每批都有 100 个尺寸——即使是最后一个。此外,它不是从第二个 'epoch' 开始的 'first record'。
2.
如何获取记录大小(在本例中为 2193)?我应该硬编码吗?或者还有其他聪明的方法吗?我使用了 tendor.get_shape().as_list() 但它不适用于 batch_xs。它只是 returns 我的空形 [].
我们最近向 TensorFlow 添加了一个名为 tf.contrib.data
的新 API,可以更轻松地解决此类问题。 (基于 "queue runner" 的 API 使得很难在精确的纪元边界上编写计算,因为纪元边界丢失了。)
下面是一个如何使用 tf.contrib.data
重写程序的示例:
lines = tf.contrib.data.TextLineDataset("BCHARTS-BITSTAMPUSD.csv")
def decode(line):
record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)
return xy[0:-1], xy[-1:]
decoded = lines.map(decode)
batched = decoded.batch(100)
iterator = batched.make_initializable_iterator()
train_x_batch, train_y_batch = iterator.get_next()
那么训练部分可以变得简单一点:
# initialize
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# train my model
for epoch in range(training_epochs):
avg_cost = 0
total_batch = int(2193 / batch_size)
total_cost = 0.0
total_batch = 0
# Re-initialize the iterator for another epoch.
sess.run(iterator.initializer)
while True:
# NOTE: It is inefficient to make a separate sess.run() call to get each batch
# of input data and then feed it into a different sess.run() call. For better
# performance, define your training graph to take train_x_batch and train_y_batch
# directly as inputs.
try:
batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch])
except tf.errors.OutOfRangeError:
break
feed_dict = {X: batch_xs, Y: batch_ys}
c, _ = sess.run([cost, optimizer], feed_dict=feed_dict)
total_cost += c
total_batch += batch_xs.shape[0]
avg_cost = total_cost / total_batch
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))
有关如何使用新 API 的更多详细信息,请参阅 "Importing Data" programmer's guide。
我对 tensorflow 很陌生,我正在尝试使用批处理从我的 csv 文件进行训练。
这是我读取 csv 文件并进行批处理的代码
filename_queue = tf.train.string_input_producer(
['BCHARTS-BITSTAMPUSD.csv'], shuffle=False, name='filename_queue')
reader = tf.TextLineReader()
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)
# collect batches of csv in
train_x_batch, train_y_batch = \
tf.train.batch([xy[0:-1], xy[-1:]], batch_size=100)
这是训练用的:
# initialize
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
# train my model
for epoch in range(training_epochs):
avg_cost = 0
total_batch = int(2193 / batch_size)
for i in range(total_batch):
batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch])
feed_dict = {X: batch_xs, Y: batch_ys}
c, _ = sess.run([cost, optimizer], feed_dict=feed_dict)
avg_cost += c / total_batch
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))
coord.request_stop()
coord.join(threads)
这是我的问题:
1.
我的 csv 文件有 2193 条记录,我的批次大小是 100。所以我想要的是:在每个 'epoch' 中以 'first record' 开始,并训练 21 个批次和 100 条记录,最后1批93条记录。所以总共22批。
然而,我发现每批都有 100 个尺寸——即使是最后一个。此外,它不是从第二个 'epoch' 开始的 'first record'。
2.
如何获取记录大小(在本例中为 2193)?我应该硬编码吗?或者还有其他聪明的方法吗?我使用了 tendor.get_shape().as_list() 但它不适用于 batch_xs。它只是 returns 我的空形 [].
我们最近向 TensorFlow 添加了一个名为 tf.contrib.data
的新 API,可以更轻松地解决此类问题。 (基于 "queue runner" 的 API 使得很难在精确的纪元边界上编写计算,因为纪元边界丢失了。)
下面是一个如何使用 tf.contrib.data
重写程序的示例:
lines = tf.contrib.data.TextLineDataset("BCHARTS-BITSTAMPUSD.csv")
def decode(line):
record_defaults = [[0.], [0.], [0.], [0.], [0.],[0.],[0.],[0.]]
xy = tf.decode_csv(value, record_defaults=record_defaults)
return xy[0:-1], xy[-1:]
decoded = lines.map(decode)
batched = decoded.batch(100)
iterator = batched.make_initializable_iterator()
train_x_batch, train_y_batch = iterator.get_next()
那么训练部分可以变得简单一点:
# initialize
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# train my model
for epoch in range(training_epochs):
avg_cost = 0
total_batch = int(2193 / batch_size)
total_cost = 0.0
total_batch = 0
# Re-initialize the iterator for another epoch.
sess.run(iterator.initializer)
while True:
# NOTE: It is inefficient to make a separate sess.run() call to get each batch
# of input data and then feed it into a different sess.run() call. For better
# performance, define your training graph to take train_x_batch and train_y_batch
# directly as inputs.
try:
batch_xs, batch_ys = sess.run([train_x_batch, train_y_batch])
except tf.errors.OutOfRangeError:
break
feed_dict = {X: batch_xs, Y: batch_ys}
c, _ = sess.run([cost, optimizer], feed_dict=feed_dict)
total_cost += c
total_batch += batch_xs.shape[0]
avg_cost = total_cost / total_batch
print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.9f}'.format(avg_cost))
有关如何使用新 API 的更多详细信息,请参阅 "Importing Data" programmer's guide。