GCS 路径的 Kaggle TPU NotFoundError
Kaggle TPU NotFoundError for the GCS path
我正在尝试在 kaggle 内核中使用我自己的数据集训练神经网络,如下所示:
%%time
history = model.fit(train_dataset,
steps_per_epoch=train_labels.shape[0] // BATCH_SIZE,
callbacks=[lr_callback],
epochs=EPOCHS,
validation_data=valid_dataset)
在启用 TPU 并设置如下路径之前:
GCS_DS_PATH = KaggleDatasets().get_gcs_path('my-first-data') # you can list the bucket with "!gsutil ls $GCS_DS_PATH"
!gsutil ls $GCS_DS_PATH
clear_output()
我使用:
张量流版本 2.1.0
tensorflow.keras 版本 2.2.4-tf
然而这是我收到的错误。我 运行 之前使用来自 kaggle 比赛的数据集使用完全相同的代码,它运行良好。现在我正尝试在我自己的数据集上 运行 它,我遇到了这个问题。我的数据与比赛数据具有完全相同的结构,并且错误消息中的文件 Train_17.jpg 没有丢失(我检查过)。
我想知道,这是否与 TPU 有关,因为数据是从云存储桶中读取的,而我个人的(但 public!)kaggle 数据集可能不允许这样做?
你有什么建议吗?
Train for 28 steps, validate for 5 steps
Epoch 00001: LearningRateScheduler reducing learning rate to 1e-05.
Epoch 1/40
1/28 [>.............................] - ETA: 59:20
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
817 max_queue_size=max_queue_size,
818 workers=workers,
--> 819 use_multiprocessing=use_multiprocessing)
820
821 def evaluate(self,
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
340 mode=ModeKeys.TRAIN,
341 training_context=training_context,
--> 342 total_epochs=epochs)
343 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
344
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
126 step=step, mode=mode, size=current_batch_size) as batch_logs:
127 try:
--> 128 batch_outs = execution_function(iterator)
129 except (StopIteration, errors.OutOfRangeError):
130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
96 # `numpy` translates Tensors to values in Eager mode.
97 return nest.map_structure(_non_none_constant_value,
---> 98 distributed_function(input_fn))
99
100 return execution_function
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py in map_structure(func, *structure, **kwargs)
566
567 return pack_sequence_as(
--> 568 structure[0], [func(*x) for x in entries],
569 expand_composites=expand_composites)
570
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py in <listcomp>(.0)
566
567 return pack_sequence_as(
--> 568 structure[0], [func(*x) for x in entries],
569 expand_composites=expand_composites)
570
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in _non_none_constant_value(v)
128
129 def _non_none_constant_value(v):
--> 130 constant_value = tensor_util.constant_value(v)
131 return constant_value if constant_value is not None else v
132
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py in constant_value(tensor, partial)
820 """
821 if isinstance(tensor, ops.EagerTensor):
--> 822 return tensor.numpy()
823 if not is_tensor(tensor):
824 return tensor
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py in numpy(self)
940 """
941 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 942 maybe_arr = self._numpy() # pylint: disable=protected-access
943 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
944
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py in _numpy(self)
908 return self._numpy_internal()
909 except core._NotOkStatusException as e:
--> 910 six.raise_from(core._status_to_exception(e.code, e.message), None)
911
912 @property
/opt/conda/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
NotFoundError: {{function_node __inference_distributed_function_519795}} Error executing an HTTP request: HTTP response code 404 with body '{
"error": {
"code": 404,
"message": "No such object: kds-f683341923266d33718e6f3ab31b298eb2f954595ee701388c328ce7/images/Train_17.jpg",
"errors": [
{
"message": "No such object: kds-f683341923266d33718e6f3ab31b298eb2f954595ee701388c328ce7/images/Train_17.jpg",
"domain": "global",
"reason": "notFound"
}
]
}
}
'
when reading metadata of gs://kds-f683341923266d33718e6f3ab31b298eb2f954595ee701388c328ce7/images/Train_17.jpg
[[{{node ReadFile}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
我使用了错误的路径来访问我的数据。我的图片在 "my-first-data/plant-pathology-2020-cropped-images/images"
GCS_DS_PATH = KaggleDatasets().get_gcs_path('my-first-data')
!gsutil ls $GCS_DS_PATH
显示存储桶包含此 "plant-pathology-2020-cropped-images" 文件夹。
这就是您定义图像路径以将它们传递给模型的方式。
def format_path(st):
return GCS_DS_PATH + '/plant-pathology-2020-cropped-images/images/' + st + '.jpg'
train_paths = df_train.image_id.apply(format_path).values
我正在尝试在 kaggle 内核中使用我自己的数据集训练神经网络,如下所示:
%%time
history = model.fit(train_dataset,
steps_per_epoch=train_labels.shape[0] // BATCH_SIZE,
callbacks=[lr_callback],
epochs=EPOCHS,
validation_data=valid_dataset)
在启用 TPU 并设置如下路径之前:
GCS_DS_PATH = KaggleDatasets().get_gcs_path('my-first-data') # you can list the bucket with "!gsutil ls $GCS_DS_PATH"
!gsutil ls $GCS_DS_PATH
clear_output()
我使用:
张量流版本 2.1.0 tensorflow.keras 版本 2.2.4-tf
然而这是我收到的错误。我 运行 之前使用来自 kaggle 比赛的数据集使用完全相同的代码,它运行良好。现在我正尝试在我自己的数据集上 运行 它,我遇到了这个问题。我的数据与比赛数据具有完全相同的结构,并且错误消息中的文件 Train_17.jpg 没有丢失(我检查过)。
我想知道,这是否与 TPU 有关,因为数据是从云存储桶中读取的,而我个人的(但 public!)kaggle 数据集可能不允许这样做?
你有什么建议吗?
Train for 28 steps, validate for 5 steps
Epoch 00001: LearningRateScheduler reducing learning rate to 1e-05.
Epoch 1/40
1/28 [>.............................] - ETA: 59:20
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
817 max_queue_size=max_queue_size,
818 workers=workers,
--> 819 use_multiprocessing=use_multiprocessing)
820
821 def evaluate(self,
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
340 mode=ModeKeys.TRAIN,
341 training_context=training_context,
--> 342 total_epochs=epochs)
343 cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
344
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
126 step=step, mode=mode, size=current_batch_size) as batch_logs:
127 try:
--> 128 batch_outs = execution_function(iterator)
129 except (StopIteration, errors.OutOfRangeError):
130 # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
96 # `numpy` translates Tensors to values in Eager mode.
97 return nest.map_structure(_non_none_constant_value,
---> 98 distributed_function(input_fn))
99
100 return execution_function
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py in map_structure(func, *structure, **kwargs)
566
567 return pack_sequence_as(
--> 568 structure[0], [func(*x) for x in entries],
569 expand_composites=expand_composites)
570
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py in <listcomp>(.0)
566
567 return pack_sequence_as(
--> 568 structure[0], [func(*x) for x in entries],
569 expand_composites=expand_composites)
570
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in _non_none_constant_value(v)
128
129 def _non_none_constant_value(v):
--> 130 constant_value = tensor_util.constant_value(v)
131 return constant_value if constant_value is not None else v
132
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_util.py in constant_value(tensor, partial)
820 """
821 if isinstance(tensor, ops.EagerTensor):
--> 822 return tensor.numpy()
823 if not is_tensor(tensor):
824 return tensor
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py in numpy(self)
940 """
941 # TODO(slebedev): Consider avoiding a copy for non-CPU or remote tensors.
--> 942 maybe_arr = self._numpy() # pylint: disable=protected-access
943 return maybe_arr.copy() if isinstance(maybe_arr, np.ndarray) else maybe_arr
944
/opt/conda/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py in _numpy(self)
908 return self._numpy_internal()
909 except core._NotOkStatusException as e:
--> 910 six.raise_from(core._status_to_exception(e.code, e.message), None)
911
912 @property
/opt/conda/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
NotFoundError: {{function_node __inference_distributed_function_519795}} Error executing an HTTP request: HTTP response code 404 with body '{
"error": {
"code": 404,
"message": "No such object: kds-f683341923266d33718e6f3ab31b298eb2f954595ee701388c328ce7/images/Train_17.jpg",
"errors": [
{
"message": "No such object: kds-f683341923266d33718e6f3ab31b298eb2f954595ee701388c328ce7/images/Train_17.jpg",
"domain": "global",
"reason": "notFound"
}
]
}
}
'
when reading metadata of gs://kds-f683341923266d33718e6f3ab31b298eb2f954595ee701388c328ce7/images/Train_17.jpg
[[{{node ReadFile}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
我使用了错误的路径来访问我的数据。我的图片在 "my-first-data/plant-pathology-2020-cropped-images/images"
GCS_DS_PATH = KaggleDatasets().get_gcs_path('my-first-data')
!gsutil ls $GCS_DS_PATH
显示存储桶包含此 "plant-pathology-2020-cropped-images" 文件夹。 这就是您定义图像路径以将它们传递给模型的方式。
def format_path(st):
return GCS_DS_PATH + '/plant-pathology-2020-cropped-images/images/' + st + '.jpg'
train_paths = df_train.image_id.apply(format_path).values