如何为 google tensorflow attention ocr 创建自定义数据集？

Question

我能够根据 . But I don't know whether I should write all images into a single TFRecord file or create multiple TFRecord files. Also, I don't quite understand the config file 为日期集创建 TFRecord 文件。 "charset_filename" 文件中应该包含哪些内容？它应该是数据集中所有可能字符的集合吗？在生成 TFRecord 文件时，我们将字符转换为整数 id，此文件应该包含字符还是它们的 id？

Answer 1

whether I should write all images into a single TFRecord file or create multiple TFRecord files

这取决于训练数据的大小，并且会影响并行预取以填充队列。我建议每个分片约 1000 个样本（一个后缀为 num-of-total 的 tfrecord 文件，例如 /path/to/my/dataset-00000-of-00512）。

What content should be in "charset_filename" file?

这是一个文本文件，定义了整数id和对应字符之间的映射。它具有以下格式： <id><TAB><character> 文件中的一行应为 <nul> 字符定义一个 id - 模型在到达序列末尾时输出的特殊字符，以将输出填充到固定长度。

例如，这里是 FSNS 数据集的 charset file:

的摘录

0    
133 <nul>
1   l
2   ’
3   é
4   t

请注意 <SPACE> 角色的 id=0。

Should it be a collection of all posible chracters in the dataset?

是的。此文件应为数据集中的所有字符定义 id-to-character 映射。

When generating TFRecord file, we converted charcters to integer ids, should this file include characters or their ids?

两者都有。文件中的每一行都应采用 <id><TAB><character>.

的形式

如何为 google tensorflow attention ocr 创建自定义数据集？

how to create cutomized dataset for google tensorflow attention ocr?

python

ocr

tensorflow