ValueError: Generator yielding element of unexpected shape when using tf.data.Dataset.from_generator().padded_batch() - what am I doing wrong?
ValueError: Generator yielding element of unexpected shape when using tf.data.Dataset.from_generator().padded_batch() - what am I doing wrong?
我正在尝试使用 tensorflow(版本 2.2.0)训练命名实体识别模型。我一直在使 this model 适应 tensorflow 2。该模型利用 tf.data.Dataset.from_generator
和 .padded_batch
属性有效地从磁盘流式传输训练数据。但是,我一直收到与生成器输出的数据形状有关的错误。
这是我的生成器函数代码以及将它包装成 tf.data.Dataset.from_generator
:
的函数
# python 3.7.7
import tensorflow as tf
tf.__version__
# 2.2.0
def generator(sent_file, tag_file):
with open(sent_file, "r") as sents, open(tag_file, "r") as tags:
for line_sents, line_tags in zip(sents, tags):
yield parser(line_sents, line_tags)
def parser(line_sents, line_tags):
# Words and tags.
words = [w.encode() for w in line_sents.strip("\n").split()]
tags = [t.encode() for t in line_tags.strip("\n").split()]
# Characters.
chars = [[c.encode() for c in w] for w in line_sents.strip("\n").split()]
lengths = [len(c) for c in chars]
max_len = max(lengths)
chars = [c + [b"<pad>"] * (max_len - 1) for c, l in zip(chars, lengths)]
# breakpoint() # BREAKPOINT 1
return ((words, len(words)), (chars, lengths)), tags
def inputter(wordpath, tagpath, params=None, shuffle_and_repeat=False):
params = params if params is not None else {}
shapes = (((tf.TensorShape(dims=[None]), tf.TensorShape(dims=())), # words, num_words
(tf.TensorShape(dims=[None, None]), tf.TensorShape(dims=[None]))),
tf.TensorShape(dims=[None])) # tags
types = (((tf.string, tf.int32),
(tf.string, tf.int32)),
tf.string)
defaults = ((('<pad>', 0),
('<pad>', 0)),
'O')
dataset = tf.data.Dataset.from_generator(
generator=generator,
output_shapes=shapes,
output_types=types,
args=(wordpath, tagpath)
)
# breakpoint() # BREAKPOINT 2.
if shuffle_and_repeat:
dataset = dataset.shuffle(params['buffer']).repeat(params['epochs'])
dataset = (dataset
.padded_batch(params.get('batch_size', 20),
padded_shapes=shapes,
padding_values=defaults)
)
# breakpoint() # BREAKPOINT 3.
return dataset
当我到达脚本的 tf.estimator.train_and_evaluate
行时,出现以下错误:
ValueError: `generator` yielded an element of shape (29,) where an element of shape (None, None) was expected.
我在三个注释掉的 breakpoint()
行插入了断点来调试我的脚本。
在 parser
函数内部的断点 1 处,要返回的值显示正确并且与 tf.data.Dataset.from_generator
的 output_shapes
参数中指定的 rank/dimensions 匹配:
# (Pdb) words
[b'No', b'association', b'was', b'also', b'found', b'in', b'European', b'and', b'Asian', b'individuals', b'hospital', b'based', b'controls', b'ever', b'smoking', b'subjects', b'DM', b'assessment', b'by', b'medical', b'record', b'or', b'physician', b'diagnosis', b'and', b'insulin', b'prescription', b'for', b'DM']
# (Pdb) len(words)
29
# (Pdb) chars
[[b'N', b'o', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b's', b's', b'o', b'c', b'i', b'a', b't', b'i', b'o', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'w', b'a', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b'l', b's', b'o', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'f', b'o', b'u', b'n', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'i', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'E', b'u', b'r', b'o', b'p', b'e', b'a', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b'n', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'A', b's', b'i', b'a', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'i', b'n', b'd', b'i', b'v', b'i', b'd', b'u', b'a', b'l', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'h', b'o', b's', b'p', b'i', b't', b'a', b'l', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'b', b'a', b's', b'e', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'c', b'o', b'n', b't', b'r', b'o', b'l', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'e', b'v', b'e', b'r', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b's', b'm', b'o', b'k', b'i', b'n', b'g', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b's', b'u', b'b', b'j', b'e', b'c', b't', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'D', b'M', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b's', b's', b'e', b's', b's', b'm', b'e', b'n', b't', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'b', b'y', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'm', b'e', b'd', b'i', b'c', b'a', b'l', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'r', b'e', b'c', b'o', b'r', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'o', b'r', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'p', b'h', b'y', b's', b'i', b'c', b'i', b'a', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'd', b'i', b'a', b'g', b'n', b'o', b's', b'i', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b'n', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'i', b'n', b's', b'u', b'l', b'i', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'p', b'r', b'e', b's', b'c', b'r', b'i', b'p', b't', b'i', b'o', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'f', b'o', b'r', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'D', b'M', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>']]
# (Pdb) lengths
[2, 11, 3, 4, 5, 2, 8, 3, 5, 11, 8, 5, 8, 4, 7, 8, 2, 10, 2, 7, 6, 2, 9, 9, 3, 7, 12, 3, 2]
# (Pdb) tags
[b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'B', b'O', b'O', b'O']
在断点 2 处,在 tf.data.Dataset.from_generator
创建后立即在 inputter
函数内,数据集的形状如预期的那样:
# (Pbd) dataset.element_spec
(((TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None)), (TensorSpec(shape=(None, None), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))), TensorSpec(shape=(None,), dtype=tf.string, name=None))
在断点 3 处,调用 .padded_batch
后,数据集的每个嵌套元素的等级都增加了 1,占批大小?
# (Pdb) dataset.element_spec
(((TensorSpec(shape=(None, None), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None)), (TensorSpec(shape=(None, None, None), dtype=tf.string, name=None), TensorSpec(shape=(None, None), dtype=tf.int32, name=None))), TensorSpec(shape=(None, None), dtype=tf.string, name=None))
任何人都可以帮助我了解出了什么问题吗?谢谢。
您的代码中有错字,1 而不是 l;更改行:
chars = [c + [b"<pad>"] * (max_len - 1) for c, l in zip(chars, lengths)] ,
至:
chars = [c + [b"<pad>"] * (max_len - l) for c, l in zip(chars, lengths)] ,
我正在尝试使用 tensorflow(版本 2.2.0)训练命名实体识别模型。我一直在使 this model 适应 tensorflow 2。该模型利用 tf.data.Dataset.from_generator
和 .padded_batch
属性有效地从磁盘流式传输训练数据。但是,我一直收到与生成器输出的数据形状有关的错误。
这是我的生成器函数代码以及将它包装成 tf.data.Dataset.from_generator
:
# python 3.7.7
import tensorflow as tf
tf.__version__
# 2.2.0
def generator(sent_file, tag_file):
with open(sent_file, "r") as sents, open(tag_file, "r") as tags:
for line_sents, line_tags in zip(sents, tags):
yield parser(line_sents, line_tags)
def parser(line_sents, line_tags):
# Words and tags.
words = [w.encode() for w in line_sents.strip("\n").split()]
tags = [t.encode() for t in line_tags.strip("\n").split()]
# Characters.
chars = [[c.encode() for c in w] for w in line_sents.strip("\n").split()]
lengths = [len(c) for c in chars]
max_len = max(lengths)
chars = [c + [b"<pad>"] * (max_len - 1) for c, l in zip(chars, lengths)]
# breakpoint() # BREAKPOINT 1
return ((words, len(words)), (chars, lengths)), tags
def inputter(wordpath, tagpath, params=None, shuffle_and_repeat=False):
params = params if params is not None else {}
shapes = (((tf.TensorShape(dims=[None]), tf.TensorShape(dims=())), # words, num_words
(tf.TensorShape(dims=[None, None]), tf.TensorShape(dims=[None]))),
tf.TensorShape(dims=[None])) # tags
types = (((tf.string, tf.int32),
(tf.string, tf.int32)),
tf.string)
defaults = ((('<pad>', 0),
('<pad>', 0)),
'O')
dataset = tf.data.Dataset.from_generator(
generator=generator,
output_shapes=shapes,
output_types=types,
args=(wordpath, tagpath)
)
# breakpoint() # BREAKPOINT 2.
if shuffle_and_repeat:
dataset = dataset.shuffle(params['buffer']).repeat(params['epochs'])
dataset = (dataset
.padded_batch(params.get('batch_size', 20),
padded_shapes=shapes,
padding_values=defaults)
)
# breakpoint() # BREAKPOINT 3.
return dataset
当我到达脚本的 tf.estimator.train_and_evaluate
行时,出现以下错误:
ValueError: `generator` yielded an element of shape (29,) where an element of shape (None, None) was expected.
我在三个注释掉的 breakpoint()
行插入了断点来调试我的脚本。
在 parser
函数内部的断点 1 处,要返回的值显示正确并且与 tf.data.Dataset.from_generator
的 output_shapes
参数中指定的 rank/dimensions 匹配:
# (Pdb) words
[b'No', b'association', b'was', b'also', b'found', b'in', b'European', b'and', b'Asian', b'individuals', b'hospital', b'based', b'controls', b'ever', b'smoking', b'subjects', b'DM', b'assessment', b'by', b'medical', b'record', b'or', b'physician', b'diagnosis', b'and', b'insulin', b'prescription', b'for', b'DM']
# (Pdb) len(words)
29
# (Pdb) chars
[[b'N', b'o', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b's', b's', b'o', b'c', b'i', b'a', b't', b'i', b'o', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'w', b'a', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b'l', b's', b'o', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'f', b'o', b'u', b'n', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'i', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'E', b'u', b'r', b'o', b'p', b'e', b'a', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b'n', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'A', b's', b'i', b'a', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'i', b'n', b'd', b'i', b'v', b'i', b'd', b'u', b'a', b'l', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'h', b'o', b's', b'p', b'i', b't', b'a', b'l', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'b', b'a', b's', b'e', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'c', b'o', b'n', b't', b'r', b'o', b'l', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'e', b'v', b'e', b'r', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b's', b'm', b'o', b'k', b'i', b'n', b'g', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b's', b'u', b'b', b'j', b'e', b'c', b't', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'D', b'M', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b's', b's', b'e', b's', b's', b'm', b'e', b'n', b't', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'b', b'y', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'm', b'e', b'd', b'i', b'c', b'a', b'l', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'r', b'e', b'c', b'o', b'r', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'o', b'r', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'p', b'h', b'y', b's', b'i', b'c', b'i', b'a', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'd', b'i', b'a', b'g', b'n', b'o', b's', b'i', b's', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'a', b'n', b'd', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'i', b'n', b's', b'u', b'l', b'i', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'p', b'r', b'e', b's', b'c', b'r', b'i', b'p', b't', b'i', b'o', b'n', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'f', b'o', b'r', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>'], [b'D', b'M', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>', b'<pad>']]
# (Pdb) lengths
[2, 11, 3, 4, 5, 2, 8, 3, 5, 11, 8, 5, 8, 4, 7, 8, 2, 10, 2, 7, 6, 2, 9, 9, 3, 7, 12, 3, 2]
# (Pdb) tags
[b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'O', b'B', b'O', b'O', b'O']
在断点 2 处,在 tf.data.Dataset.from_generator
创建后立即在 inputter
函数内,数据集的形状如预期的那样:
# (Pbd) dataset.element_spec
(((TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int32, name=None)), (TensorSpec(shape=(None, None), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))), TensorSpec(shape=(None,), dtype=tf.string, name=None))
在断点 3 处,调用 .padded_batch
后,数据集的每个嵌套元素的等级都增加了 1,占批大小?
# (Pdb) dataset.element_spec
(((TensorSpec(shape=(None, None), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None)), (TensorSpec(shape=(None, None, None), dtype=tf.string, name=None), TensorSpec(shape=(None, None), dtype=tf.int32, name=None))), TensorSpec(shape=(None, None), dtype=tf.string, name=None))
任何人都可以帮助我了解出了什么问题吗?谢谢。
您的代码中有错字,1 而不是 l;更改行:
chars = [c + [b"<pad>"] * (max_len - 1) for c, l in zip(chars, lengths)] ,
至:
chars = [c + [b"<pad>"] * (max_len - l) for c, l in zip(chars, lengths)] ,