从可能丢失某些文件的文件中读取数据集

Reading Dataset from files where some might be missing

我正在尝试将文件加载到 TensorFlow 数据集,其中某些文件可能会丢失(在这种情况下我想用零替换这些文件)。

我试图从中读取数据的目录结构如下:

   |-data
   |---sensor_A
   |-----1.dat
   |-----2.dat
   |-----3.dat
   |---sensor_B
   |-----1.dat
   |-----2.dat
   |-----3.dat

.dat 文件是以空格键作为分隔符的 .csv 文件。每个文件的内容都是单个多行观察,其中列数是恒定的(比如 4),行数是未知的(时间序列数据)。

我已经成功地使用以下代码将每个传感器数据读取到单独的 TensorFlow 数据集中:

import os
import tensorflow as tf

tf.enable_eager_execution()

data_root_dir = "data"

modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]

for mod_idx, modality in enumerate(modalities_to_use):
    # Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
    filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]

    dataset = tf.data.Dataset.from_tensor_slices((filenames,))


    def _parse_function_internal(filename):
        number_of_columns = 4
        single_observation = tf.read_file(filename)
        # Tokenise every value so we can cast these to floats later.
        single_observation = tf.string_split([single_observation], sep='\r\n ').values
        single_observation = tf.reshape(single_observation, (-1, number_of_columns))
        single_observation = tf.strings.to_number(single_observation, tf.float32)
        return filename, single_observation

    dataset = dataset.map(_parse_function_internal)

    print('Result:')
    for el in dataset:
        try:
            # Filename
            print(el[0])
            # Parsed file content
            print(el[1])
        except tf.errors.OutOfRangeError:
            break

成功打印出每个传感器的所有三个文件的内容。

我的问题是数据集中的一些时间戳可能丢失了。例如,如果 sensor_A 目录中的文件 1.dat 丢失,我会收到此错误:

tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A.dat : The system cannot find the file specified.
; No such file or directory
     [[{{node ReadFile}}]] [Op:IteratorGetNextSync]

这行抛出:

for el in dataset:

我试图做的是用 try 块包围对 tf.read_file() 函数的调用,但显然它不起作用,因为调用 tf.read_file() 时不会抛出错误,但是当从数据集中获取值时。稍后我想将这个数据集传递给 Keras 模型,这样我就不能只用 try 块包围它。有什么解决方法吗?甚至支持吗?

谢谢!

我设法解决了这个问题,分享解决方案以防其他人也遇到这个问题。我不得不使用额外的布尔值列表来指定文件是否实际存在并将其传递给映射器。然后使用 tf.cond() 函数我们决定是读取文件还是用零(或任何其他逻辑)模拟数据。

import os
import tensorflow as tf

tf.enable_eager_execution()

data_root_dir = "data"

modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]

for mod_idx, modality in enumerate(modalities_to_use):
    # Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
    filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
    files_exist = [os.path.isfile(filename) for filename in filenames]

    dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))


    def _parse_function_internal(filename, file_exist):
        number_of_columns = 4
        single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
        # Tokenise every value so we can cast these to floats later.
        single_observation = tf.string_split([single_observation], sep='\r\n ').values
        single_observation = tf.reshape(single_observation, (-1, number_of_columns))
        single_observation = tf.strings.to_number(single_observation, tf.float32)
        return filename, single_observation

    dataset = dataset.map(_parse_function_internal)

    print('Result:')
    for el in dataset:
        try:
            # Filename
            print(el[0])
            # Parsed file content
            print(el[1])
        except tf.errors.OutOfRangeError:
            break