带验证的 Sagemaker 随机砍伐森林训练

Sagemaker Random Cut Forest Training with Validation

sagemaker 内置 rcf 算法困扰了几天。

我想在训练过程中验证模型,但可能有些地方我没有理解正确。

首次拟合仅与训练通道工作正常:

container=sagemaker.image_uris.retrieve("randomcutforest", region, "us-east-1")
print(container)

rcf = sagemaker.estimator.Estimator(
    image_uri=container,
    role=role,
    instance_count=1,
    sagemaker_session=sagemaker.Session(),
    instance_type="ml.m4.xlarge",
    data_location=f"s3://{bucket}/{prefix}/",
    output_path=f"s3://{bucket}/{prefix}/output"
)

rcf.set_hyperparameters(
    feature_dim = 116,
    eval_metrics = 'precision_recall_fscore',
    num_samples_per_tree=256,
    num_trees=100,
    
)

train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')

rcf.fit({'train': train_data})
[06/28/2021 09:45:24 INFO 140226936620864] Test data is not provided.
#metrics {"StartTime": 1624873524.6154933, "EndTime": 1624873524.6156445, "Dimensions": {"Algorithm": "RandomCutForest", "Host": "algo-1", "Operation": "training"}, "Metrics": {"setuptime": {"sum": 40.169477462768555, "count": 1, "min": 40.169477462768555, "max": 40.169477462768555}, "totaltime": {"sum": 13035.491704940796, "count": 1, "min": 13035.491704940796, "max": 13035.491704940796}}}


2021-06-28 09:45:50 Completed - Training job completed
ProfilerReport-1624873226: NoIssuesFound
Training seconds: 78
Billable seconds: 78

但是当我想在训练期间验证我的模型时:

train_data = sagemaker.inputs.TrainingInput(s3_data=train_location, content_type='text/csv;label_size=0', distribution='ShardedByS3Key')
val_data = sagemaker.inputs.TrainingInput(s3_data=val_location, content_type='text/csv;label_size=1', distribution='FullyReplicated')


rcf.fit({'train': train_data, 'validation': val_data}, wait=True)

我收到错误:

AWS Region: us-east-1
RoleArn: arn:aws:iam::517714493426:role/service-role/AmazonSageMaker-ExecutionRole-20210409T152960
382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:1
2021-06-28 10:14:12 Starting - Starting the training job...
2021-06-28 10:14:14 Starting - Launching requested ML instancesProfilerReport-1624875252: InProgress
......
2021-06-28 10:15:27 Starting - Preparing the instances for training.........
2021-06-28 10:17:07 Downloading - Downloading input data...
2021-06-28 10:17:27 Training - Downloading the training image..Docker entrypoint called with argument(s): train
Running default environment configuration script
[06/28/2021 10:17:53 INFO 140648505521984] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-conf.json: {'num_samples_per_tree': 256, 'num_trees': 100, 'force_dense': 'true', 'eval_metrics': ['accuracy', 'precision_recall_fscore'], 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999}
[06/28/2021 10:17:53 INFO 140648505521984] Merging with provided configuration from /opt/ml/input/config/hyperparameters.json: {'num_trees': '100', 'num_samples_per_tree': '256', 'feature_dim': '116', 'eval_metrics': 'precision_recall_fscore'}
[06/28/2021 10:17:53 INFO 140648505521984] Final configuration: {'num_samples_per_tree': '256', 'num_trees': '100', 'force_dense': 'true', 'eval_metrics': 'precision_recall_fscore', 'epochs': 1, 'mini_batch_size': 1000, '_log_level': 'info', '_kvstore': 'dist_async', '_num_kv_servers': 'auto', '_num_gpus': 'auto', '_tuning_objective_metric': '', '_ftp_port': 8999, 'feature_dim': '116'}
[06/28/2021 10:17:53 ERROR 140648505521984] Customer Error: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)

Caused by: Additional properties are not allowed ('validation' was unexpected)

Failed validating 'additionalProperties' in schema:
    {'$schema': 'http://json-schema.org/draft-04/schema#',
     'additionalProperties': False,
     'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'},
                                                                'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
                                                                'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'},
                                                                'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
                                                 'type': 'object'},
                     'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},
                                                             'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
                                                             'S3DistributionType': {'$ref': '#/definitions/s3_sharded_type'},
                                                             'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
                                              'type': 'object'},
                     'record_wrapper_type': {'enum': ['None', 'Recordio'],
                                             'type': 'string'},
                     's3_replicated_type': {'enum': ['FullyReplicated'],
                                            'type': 'string'},
                     's3_sharded_type': {'enum': ['ShardedByS3Key'],
                                         'type': 'string'},
                     'training_input_mode': {'enum': ['File', 'Pipe'],
                                             'type': 'string'}},
     'properties': {'state': {'$ref': '#/definitions/data_channel'},
                    'test': {'$ref': '#/definitions/data_channel_replicated'},
                    'train': {'$ref': '#/definitions/data_channel_sharded'}},
     'required': ['train'],
     'type': 'object'}

On instance:
    {'train': {'ContentType': 'text/csv;label_size=0',
               'RecordWrapperType': 'None',
               'S3DistributionType': 'ShardedByS3Key',
               'TrainingInputMode': 'File'},
     'validation': {'ContentType': 'text/csv;label_size=1',
                    'RecordWrapperType': 'None',
                    'S3DistributionType': 'FullyReplicated',
                    'TrainingInputMode': 'File'}}

2021-06-28 10:18:10 Uploading - Uploading generated training model
2021-06-28 10:18:10 Failed - Training job failed
ProfilerReport-1624875252: Stopping
---------------------------------------------------------------------------
UnexpectedStatusException                 Traceback (most recent call last)
<ipython-input-34-c624ace00c69> in <module>
     33 
     34 
---> 35 rcf.fit({'train': train_data, 'validation': val_data}, wait=True)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
    680         self.jobs.append(self.latest_training_job)
    681         if wait:
--> 682             self.latest_training_job.wait(logs=logs)
    683 
    684     def _compilation_job_name(self):

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/estimator.py in wait(self, logs)
   1623         # If logs are requested, call logs_for_jobs.
   1624         if logs != "None":
-> 1625             self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
   1626         else:
   1627             self.sagemaker_session.wait_for_job(self.job_name)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in logs_for_job(self, job_name, wait, poll, log_type)
   3679 
   3680         if wait:
-> 3681             self._check_job_status(job_name, description, "TrainingJobStatus")
   3682             if dot:
   3683                 print()

~/anaconda3/envs/python3/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
   3243                 ),
   3244                 allowed_statuses=["Completed", "Stopped"],
-> 3245                 actual_status=status,
   3246             )
   3247 

UnexpectedStatusException: Error for Training job randomcutforest-2021-06-28-10-14-12-783: Failed. Reason: ClientError: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)

Caused by: Additional properties are not allowed ('validation' was unexpected)

Failed validating 'additionalProperties' in schema:
    {'$schema': 'http://json-schema.org/draft-04/schema#',
     'additionalProperties': False,
     'definitions': {'data_channel_replicated': {'properties': {'ContentType': {'type': 'string'},
                                                                'RecordWrapperType': {'$ref': '#/definitions/record_wrapper_type'},
                                                                'S3DistributionType': {'$ref': '#/definitions/s3_replicated_type'},
                                                                'TrainingInputMode': {'$ref': '#/definitions/training_input_mode'}},
                                                 'type': 'object'},
                     'data_channel_sharded': {'properties': {'ContentType': {'type': 'string'},

有人可以帮助我,我如何在训练期间正确实施此验证? 这将是最好的实际发生在我身上的事情。 :-D

亲切的问候, 克里斯蒂娜

我发现了错误:您需要将频道命名为 'test' 而不是 'validation',然后它就可以工作了: rcf.fit({'train': train_data, 'test': test_data}, wait=True)