Tensorflow：创建用于机器翻译的自定义文本数据集

Question

我想使用我自己的数据来训练模型 machine translation system using Transformers. There are a set of datasets already available in TFDS (Tensorflow datasets) and there is also option to add a new dataset 到 TFDS。但是，如果我不必等待那些添加请求和其他东西并直接训练我的数据怎么办？

在示例 colab 笔记本中，他们使用以下内容创建训练和验证数据：

examples, metadata = tfds.load('ted_hrlr_translate/pt_to_en', with_info=True,
                               as_supervised=True)
train_examples, val_examples = examples['train'], examples['validation']

我相信 TFDS 做了很多预处理以适应管道并且它是数据集类型。

type(train_examples)

tensorflow.python.data.ops.dataset_ops._OptionsDataset

但是对于像下面这样的自定义 CSV 数据，我该如何创建一个 'Dataset' 与此模型兼容的数据？

import pandas as pd 

# initialize list of lists 
data = [['tom', 10], ['nick', 15], ['juli', 14],['tom', 10], ['nick', 15]]
# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Age']) 

# print dataframe. 
df

Answer 1

colab notebook 中的数据集只是字符串对（句子的翻译对）的集合。这似乎不是你那里的东西（你有名字和年龄？？）。

但是，当然可以从语言对（或姓名和年龄！）的 csv 创建数据集。这里有一个关于数据集 API 的综合指南：https://www.tensorflow.org/guide/datasets 但本质上，给定一个名为 "translations.csv" 的 csv，看起来像这样：

hola,hello
adios,goodbye
pero,dog
huevos,eggs
...

那么我们可以这样做：

my_dataset = tf.data.experimental.CsvDataset("translations.csv", [tf.string, tf.string])

同样，对于您的 name/age 数据集，您可以执行以下操作：

my_dataset = tf.data.experimental.CsvDataset("ages.csv", [tf.string, tf.int32])

Tensorflow：创建用于机器翻译的自定义文本数据集

Tensorflow: Creating a custom text dataset to use in machine translation

python-3.x

machine-translation

tensorflow

tensorflow2.0