GCP 和 TPU，experimental_connect_to_cluster 没有回应

Question

我正在尝试在 GCP 上使用 TPU 以及带有 Keras 的 tensorflow 2.1 API。不幸的是，我在创建 tpu-node 后卡住了。事实上，我的虚拟机似乎 "see" tpu，但无法连接到它。

我使用的代码：

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

代码卡在第 3 行，我收到的消息很少，然后什么也没有，所以我不知道可能是什么问题。因此我怀疑 VM 和 TPU 之间的某些连接问题。

消息：

2020-04-22 15:46:25.383775: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2020-04-22 15:46:25.992977: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300000000 Hz 2020-04-22 15:46:26.042269: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5636e4947610 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 2020-04-22 15:46:26.042403: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 2020-04-22 15:46:26.080879: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. E0422 15:46:26.263937297 2263 socket_utils_common_posix.cc:198] check for SO_REUSEPORT: {"created":"@1587570386.263923266","description":"SO_REUSEPORT unavailable on compiling system","file":"external/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc","file_line":166} 2020-04-22 15:46:26.269134: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.163.38.90:8470} 2020-04-22 15:46:26.269192: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:32263}

此外，我使用的是来自gcp的"Deep Learning"镜像，所以我应该不需要安装任何东西吧？

有人对 TF 2.1 有同样的问题吗？ P.S：相同的代码在 Kaggle 和 Colab 上运行良好。

Answer 1

为了重现，我使用 ctpu up --zone=europe-west4-a --disk-size-gb=50 --machine-type=n1-standard-8 --tf-version=2.1 创建了 vm 和 tpu。然后运行你的代码，它成功了。

taylanbil@taylanbil:~$ python3 run.py 
Running on TPU  grpc://10.240.1.2:8470
2020-04-28 19:18:32.597556: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-04-28 19:18:32.627669: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2000189999 Hz
2020-04-28 19:18:32.630719: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x471b980 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-28 19:18:32.630759: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-28 19:18:32.665388: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-04-28 19:18:32.665439: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
2020-04-28 19:18:32.683216: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job worker -> {0 -> 10.240.1.2:8470}
2020-04-28 19:18:32.683268: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:300] Initialize GrpcChannelCache for job localhost -> {0 -> localhost:33355}
2020-04-28 19:18:32.690405: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:390] Started server with target: grpc://localhost:33355
taylanbil@taylanbil:~$ cat run.py 
import tensorflow as tf
TPU_name='taylanbil'
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(TPU_name)
print('Running on TPU ', resolver.master())
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

您如何创建您的 tpu 资源？你能仔细检查一下是否没有版本不匹配吗？

Answer 2

我使用 ctpu up --zone=europe-west4-a --disk-size-gb=50 --machine-type=n1-standard-2 --tf-version=2.2 --tpu-size v3-8 --name cola-tpu

创建了我的 VM + TPU

但我仍然无法访问 TPU，它像 OP 描述的那样挂起。

我打开了一个 Google 问题并在那里得到了答案：

This is a known issue that occurs some times and the product team is currently trying to fix it.

While that happens, let me propose some troubleshooting steps:

1- Disable and then reenable TPU API

If this does not work:

2.1- Go to VPC network > VPC network peering

2.2- Check if cp-to-tp-peeringdefault[somenumbers] has inactive status.

2.3- If it does, delete it and create a tpu node again

Please let us know if any of this worked for you so that we can close this ticket (in case it did) or keep with providing support (in case it did not).

对我来说，删除 cp-to-tp-peeringdefault 并重新创建 VM + TPU 很有效。

GCP 和 TPU，experimental_connect_to_cluster 没有回应

GCP and TPU, experimental_connect_to_cluster give no response

google-cloud-platform

tensorflow

google-cloud-tpu

tpu