Google云AI平台执行作业出错
Google cloud AI platform error in executing job
我们正在使用 python googleapiclient
API 在 AI 平台中创建工作。
from oauth2client.client import GoogleCredentials
import datetime
credentials = GoogleCredentials.get_application_default()
training_inputs = {'scaleTier':'CUSTOM','masterType':'complex_model_m',
'packageUris':['package_bucket_file_path'],
'pythonModule':'randomforest_trainer_RUL.train',
'args':[
'--trainFilePath', data[0],
'--trainOutputPath', data[2],
'--testFilePath', data[1],
'--testOutputPath', data[3],
'--target', target_label,
'--bucket', BUCKET,
'--expid', experiment_id
],
'region': "region_of_bucket",
'runtimeVersion':'1.14',
'pythonVersion':'3.5'}
timestamp = datetime.datetime.now().strftime('%y%m%d_%H%M%S%f')
job_name = "job_"+experiment_id
## logging information
logging.info("Job Name:{}".format(job_name))
##
api = discovery.build('ml', 'v1', credentials=credentials,cache_discovery=False)
project_id = 'projects/{}'.format(PROJECT)
credentials = GoogleCredentials.get_application_default()
request = api.projects().jobs().create(body=job_spec, parent=project_id)
它工作正常,我能够训练模型,直到昨天才进行测试和预测。
但是突然间我无法在 AI Platform 中训练模型,我得到的错误是
The replica master 0 exited with a non-zero status of 1. \nTraceback (most recent call last):\n [...]\n
File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 810, in ls\n
combined_listing = self._ls(path, detail) + self._ls(path + "/", detail)\n
File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-12>", line 2, in _ls\n
File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 50, in _tracemethod\n
return f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 820, in _ls\n listing = self._list_objects(path)\n File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-5>",
line 2, in _list_objects\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 50, in _tracemethod\nreturn f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 616, in _list_objects\n listing = self._do_list_objects(path)\n File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-6>",
line 2, in _do_list_objects\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 50, in _tracemethod\n return f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 637, in _do_list_objects\n maxResults=max_results,\n File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-2>",
line 2, in _call\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 50, in _tracemethod\n return f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 517, in _call\n validate_response(r, path)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 171, in validate_response\n raise IOError("Forbidden: %s\n%s" % (path, msg))\nOSError:
Forbidden: https://www.googleapis.com/storage/v1/b/some-storage-bucket/o/\nservice-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com
does not have serviceusage.services.use access to project 34XX12XX12X.\n\nTo find out more about why your job exited
please check the logs: https://console.cloud.google.com/logs/viewer?project=87XX90XX1XX&resource=ml_job%2Fjob_id%2Fjob_5de3592da3c3c541d73389er&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22job_5de3592da3c3c541d73389erce%22
我得到的错误是
service-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com
does not have serviceusage.services.use access to project 34XX12XX12X
今天遇到了确切的问题。正如尼克所说,这是 GCSFS 新版本问题。我建议您直接通过 Tensorflow GFile 函数从存储桶中读取 CSV 文件,而不是使用 pd.read_csv(gcs_path)
。
with tf.gfile.GFile(gcs_path) as f:
if(opts):
df = pd.read_csv(f, opts)
else:
df = pd.read_csv(f)
return df
它可以让您运行 不间断地完成工作。
我们正在使用 python googleapiclient
API 在 AI 平台中创建工作。
from oauth2client.client import GoogleCredentials
import datetime
credentials = GoogleCredentials.get_application_default()
training_inputs = {'scaleTier':'CUSTOM','masterType':'complex_model_m',
'packageUris':['package_bucket_file_path'],
'pythonModule':'randomforest_trainer_RUL.train',
'args':[
'--trainFilePath', data[0],
'--trainOutputPath', data[2],
'--testFilePath', data[1],
'--testOutputPath', data[3],
'--target', target_label,
'--bucket', BUCKET,
'--expid', experiment_id
],
'region': "region_of_bucket",
'runtimeVersion':'1.14',
'pythonVersion':'3.5'}
timestamp = datetime.datetime.now().strftime('%y%m%d_%H%M%S%f')
job_name = "job_"+experiment_id
## logging information
logging.info("Job Name:{}".format(job_name))
##
api = discovery.build('ml', 'v1', credentials=credentials,cache_discovery=False)
project_id = 'projects/{}'.format(PROJECT)
credentials = GoogleCredentials.get_application_default()
request = api.projects().jobs().create(body=job_spec, parent=project_id)
它工作正常,我能够训练模型,直到昨天才进行测试和预测。 但是突然间我无法在 AI Platform 中训练模型,我得到的错误是
The replica master 0 exited with a non-zero status of 1. \nTraceback (most recent call last):\n [...]\n
File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 810, in ls\n
combined_listing = self._ls(path, detail) + self._ls(path + "/", detail)\n
File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-12>", line 2, in _ls\n
File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 50, in _tracemethod\n
return f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 820, in _ls\n listing = self._list_objects(path)\n File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-5>",
line 2, in _list_objects\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 50, in _tracemethod\nreturn f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 616, in _list_objects\n listing = self._do_list_objects(path)\n File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-6>",
line 2, in _do_list_objects\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 50, in _tracemethod\n return f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 637, in _do_list_objects\n maxResults=max_results,\n File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-2>",
line 2, in _call\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 50, in _tracemethod\n return f(self, *args, **kwargs)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 517, in _call\n validate_response(r, path)\n File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py",
line 171, in validate_response\n raise IOError("Forbidden: %s\n%s" % (path, msg))\nOSError:
Forbidden: https://www.googleapis.com/storage/v1/b/some-storage-bucket/o/\nservice-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com
does not have serviceusage.services.use access to project 34XX12XX12X.\n\nTo find out more about why your job exited
please check the logs: https://console.cloud.google.com/logs/viewer?project=87XX90XX1XX&resource=ml_job%2Fjob_id%2Fjob_5de3592da3c3c541d73389er&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22job_5de3592da3c3c541d73389erce%22
我得到的错误是
service-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com
does not have serviceusage.services.use access to project 34XX12XX12X
今天遇到了确切的问题。正如尼克所说,这是 GCSFS 新版本问题。我建议您直接通过 Tensorflow GFile 函数从存储桶中读取 CSV 文件,而不是使用 pd.read_csv(gcs_path)
。
with tf.gfile.GFile(gcs_path) as f:
if(opts):
df = pd.read_csv(f, opts)
else:
df = pd.read_csv(f)
return df
它可以让您运行 不间断地完成工作。