Google云AI平台执行作业出错

Google cloud AI platform error in executing job

我们正在使用 python googleapiclient API 在 AI 平台中创建工作。

from oauth2client.client import GoogleCredentials
import datetime

credentials = GoogleCredentials.get_application_default()
training_inputs = {'scaleTier':'CUSTOM','masterType':'complex_model_m',
        'packageUris':['package_bucket_file_path'],

        'pythonModule':'randomforest_trainer_RUL.train',
        'args':[
                '--trainFilePath', data[0],
                '--trainOutputPath', data[2],
                '--testFilePath', data[1],
                '--testOutputPath', data[3],
                '--target', target_label,
                '--bucket', BUCKET,
                '--expid', experiment_id
        ],
        'region': "region_of_bucket",
        'runtimeVersion':'1.14',
        'pythonVersion':'3.5'}

timestamp = datetime.datetime.now().strftime('%y%m%d_%H%M%S%f')
job_name = "job_"+experiment_id

## logging information
logging.info("Job Name:{}".format(job_name))
##
api = discovery.build('ml', 'v1', credentials=credentials,cache_discovery=False)

project_id = 'projects/{}'.format(PROJECT)
credentials  = GoogleCredentials.get_application_default()
request = api.projects().jobs().create(body=job_spec, parent=project_id)

它工作正常,我能够训练模型,直到昨天才进行测试和预测。 但是突然间我无法在 AI Platform 中训练模型,我得到的错误是

The replica master 0 exited with a non-zero status of 1. \nTraceback (most recent call last):\n  [...]\n  
    File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 810, in ls\n    
    combined_listing = self._ls(path, detail) + self._ls(path + "/", detail)\n  
    File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-12>", line 2, in _ls\n  
    File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", line 50, in _tracemethod\n    
    return f(self, *args, **kwargs)\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 820, in _ls\n    listing = self._list_objects(path)\n  File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-5>", 
    line 2, in _list_objects\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 50, in _tracemethod\nreturn f(self, *args, **kwargs)\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 616, in _list_objects\n    listing = self._do_list_objects(path)\n  File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-6>", 
    line 2, in _do_list_objects\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 50, in _tracemethod\n    return f(self, *args, **kwargs)\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 637, in _do_list_objects\n    maxResults=max_results,\n  File "</root/.local/lib/python3.5/site-packages/decorator.py:decorator-gen-2>", 
    line 2, in _call\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 50, in _tracemethod\n    return f(self, *args, **kwargs)\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 517, in _call\n    validate_response(r, path)\n  File "/root/.local/lib/python3.5/site-packages/gcsfs/core.py", 
    line 171, in validate_response\n    raise IOError("Forbidden: %s\n%s" % (path, msg))\nOSError: 
    Forbidden: https://www.googleapis.com/storage/v1/b/some-storage-bucket/o/\nservice-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com 
    does not have serviceusage.services.use access to project 34XX12XX12X.\n\nTo find out more about why your job exited 
    please check the logs: https://console.cloud.google.com/logs/viewer?project=87XX90XX1XX&resource=ml_job%2Fjob_id%2Fjob_5de3592da3c3c541d73389er&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22job_5de3592da3c3c541d73389erce%22

我得到的错误是

service-87XX90XX1XX@cloud-ml.google.com.iam.gserviceaccount.com 
    does not have serviceusage.services.use access to project 34XX12XX12X

今天遇到了确切的问题。正如尼克所说,这是 GCSFS 新版本问题。我建议您直接通过 Tensorflow GFile 函数从存储桶中读取 CSV 文件,而不是使用 pd.read_csv(gcs_path)

with tf.gfile.GFile(gcs_path) as f:
            if(opts):
                df = pd.read_csv(f, opts)
            else:
                df = pd.read_csv(f)
        return df

它可以让您运行 不间断地完成工作。