使用 CUDA 9 为 aws 新 p3 实例提供 Tensorflow 服务编译
Tensorflow serving compilation with CUDA 9 for aws new p3 instances
我能够从 Amazon 的修改源(在新的深度学习 AMI 中提供)重新编译 Tensorflow。
我现在正在尝试使用该 Tensorflow "fork" 编译 tf 服务,但我收到了该错误:
ERROR: /root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:68:1: undeclared inclusion(s) in rule '@org_tensorflow//tensorflow/contrib/nccl:nccl_kernels':
this rule is missing dependency declarations for the following files included by 'external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_rewrite.cc':
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/optimization_registry.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/device_set.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/device.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/types.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/costmodel.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/node_builder.h'
INFO: Elapsed time: 20.377s, Critical Path: 19.47s
FAILED: Build did NOT complete successfully
更多信息:我正在使用 Tensorflow 服务的主分支(提交 7a349752c2cbbe741edb91c6c6be1c571e91a5fb
)和 Bazel 版本 0.7.0。
我还对 tools/bazel.rc
做了一个小改动以解决另一个编译错误:
# git diff tools/bazel.rc
diff --git a/tools/bazel.rc b/tools/bazel.rc
index 9397f97..28476f3 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -1,4 +1,4 @@
-build:cuda --crosstool_top=@org_tensorflow//third_party/gpus/crosstool
+build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
build:cuda --define=using_cuda=true --define=using_cuda_nvcc=true
build --force_python=py2
知道缺少什么吗?
我通常禁用 NCCL,因为它似乎永远无法正常构建:
RUN \
cd $TENSORFLOW_SERVING_HOME \
# Remove NCCL since it isn't building properly
&& sed -i.bak '/nccl/d' tensorflow/tensorflow/contrib/BUILD \
&& bazel build -c opt --config=cuda \
--verbose_failures \
--spawn_strategy=standalone --genrule_strategy=standalone \
--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2 \
--crosstool_top=@local_config_cuda//crosstool:toolchain \
tensorflow_serving/... \
&& chmod a+x bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server \
&& cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ \
&& bazel clean --expunge
我能够从 Amazon 的修改源(在新的深度学习 AMI 中提供)重新编译 Tensorflow。
我现在正在尝试使用该 Tensorflow "fork" 编译 tf 服务,但我收到了该错误:
ERROR: /root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/contrib/nccl/BUILD:68:1: undeclared inclusion(s) in rule '@org_tensorflow//tensorflow/contrib/nccl:nccl_kernels':
this rule is missing dependency declarations for the following files included by 'external/org_tensorflow/tensorflow/contrib/nccl/kernels/nccl_rewrite.cc':
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/optimization_registry.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/device_set.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/common_runtime/device.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/types.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/costmodel.h'
'/root/.cache/bazel/_bazel_root/98acb40d8921d865487eab808ed364b2/external/org_tensorflow/tensorflow/core/graph/node_builder.h'
INFO: Elapsed time: 20.377s, Critical Path: 19.47s
FAILED: Build did NOT complete successfully
更多信息:我正在使用 Tensorflow 服务的主分支(提交 7a349752c2cbbe741edb91c6c6be1c571e91a5fb
)和 Bazel 版本 0.7.0。
我还对 tools/bazel.rc
做了一个小改动以解决另一个编译错误:
# git diff tools/bazel.rc
diff --git a/tools/bazel.rc b/tools/bazel.rc
index 9397f97..28476f3 100644
--- a/tools/bazel.rc
+++ b/tools/bazel.rc
@@ -1,4 +1,4 @@
-build:cuda --crosstool_top=@org_tensorflow//third_party/gpus/crosstool
+build:cuda --crosstool_top=@local_config_cuda//crosstool:toolchain
build:cuda --define=using_cuda=true --define=using_cuda_nvcc=true
build --force_python=py2
知道缺少什么吗?
我通常禁用 NCCL,因为它似乎永远无法正常构建:
RUN \
cd $TENSORFLOW_SERVING_HOME \
# Remove NCCL since it isn't building properly
&& sed -i.bak '/nccl/d' tensorflow/tensorflow/contrib/BUILD \
&& bazel build -c opt --config=cuda \
--verbose_failures \
--spawn_strategy=standalone --genrule_strategy=standalone \
--copt=-mavx --copt=-mavx2 --copt=-mfma --copt=-mfpmath=both --copt=-msse4.1 --copt=-msse4.2 \
--crosstool_top=@local_config_cuda//crosstool:toolchain \
tensorflow_serving/... \
&& chmod a+x bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server \
&& cp bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server /usr/local/bin/ \
&& bazel clean --expunge