如何修复因 'FailureReason' 而失败的 SageMaker 数据质量监控计划作业:'Job inputs had no data'

How to fix SageMaker data-quality monitoring-schedule job that fails with 'FailureReason': 'Job inputs had no data'

我正尝试按照 this AWS documentation page 中提到的步骤在 AWS SageMaker 中安排数据质量监控作业。我已为我的端点启用数据捕获。然后,在我的训练 csv 文件上训练基线,S3 中提供统计信息和约束,如下所示:

from sagemaker import get_execution_role
from sagemaker import image_uris
from sagemaker.model_monitor.dataset_format import DatasetFormat

my_data_monitor = DefaultModelMonitor(
    role=get_execution_role(),
    instance_count=1,
    instance_type='ml.m5.large',
    volume_size_in_gb=30,
    max_runtime_in_seconds=3_600)

# base s3 directory
baseline_dir_uri = 's3://api-trial/data_quality_no_headers/'
# train data, that I have used to generate baseline
baseline_data_uri = baseline_dir_uri + 'ch_train_no_target.csv'
# directory in s3 bucket that I have stored my baseline results to 
baseline_results_uri = baseline_dir_uri + 'baseline_results_try17/'


my_data_monitor.suggest_baseline(
    baseline_dataset=baseline_data_uri,
    dataset_format=DatasetFormat.csv(header=True),
    output_s3_uri=baseline_results_uri,
    wait=True, logs=False, job_name='ch-dq-baseline-try21'
)

数据在 S3 中可用:

然后我尝试按照 this example notebook for model-quality-monitoring in sagemaker-examples github repo 安排监控作业,通过根据错误消息反馈进行必要的修改来安排我的数据质量监控作业。

以下是尝试从 SageMaker Studio 安排数据质量监控作业的方法:

from sagemaker import get_execution_role
from sagemaker.model_monitor import EndpointInput
from sagemaker import image_uris
from sagemaker.model_monitor import CronExpressionGenerator
from sagemaker.model_monitor import DefaultModelMonitor
from sagemaker.model_monitor.dataset_format import DatasetFormat

# base s3 directory
baseline_dir_uri = 's3://api-trial/data_quality_no_headers/'

# train data, that I have used to generate baseline
baseline_data_uri = baseline_dir_uri + 'ch_train_no_target.csv'

# directory in s3 bucket that I have stored my baseline results to 
baseline_results_uri = baseline_dir_uri + 'baseline_results_try17/'
# s3 locations of baseline job outputs
baseline_statistics = baseline_results_uri + 'statistics.json'
baseline_constraints = baseline_results_uri + 'constraints.json'

# directory in s3 bucket that I would like to store results of monitoring schedules in
monitoring_outputs = baseline_dir_uri + 'monitoring_results_try17/'

ch_dq_ep = EndpointInput(endpoint_name=myendpoint_name,
                         destination="/opt/ml/processing/input_data",
                         s3_input_mode="File",
                         s3_data_distribution_type="FullyReplicated")

monitor_schedule_name='ch-dq-monitor-schdl-try21'

my_data_monitor.create_monitoring_schedule(endpoint_input=ch_dq_ep,
                                           monitor_schedule_name=monitor_schedule_name,
                                           output_s3_uri=baseline_dir_uri,
                                           constraints=baseline_constraints,
                                           statistics=baseline_statistics,
                                           schedule_cron_expression=CronExpressionGenerator.hourly(),
                                           enable_cloudwatch_metrics=True)

大约一个小时后,当我这样检查日程状态时:

import boto3
boto3_sm_client = boto3.client('sagemaker')
boto3_sm_client.describe_monitoring_schedule(MonitoringScheduleName='ch-dq-monitor-schdl-try17')

我得到如下失败状态:

'MonitoringExecutionStatus': 'Failed',
  ...
  'FailureReason': 'Job inputs had no data'},

整个消息:

```
{'MonitoringScheduleArn': 'arn:aws:sagemaker:ap-south-1:<my-account-id>:monitoring-schedule/ch-dq-monitor-schdl-try21',
 'MonitoringScheduleName': 'ch-dq-monitor-schdl-try21',
 'MonitoringScheduleStatus': 'Scheduled',
 'MonitoringType': 'DataQuality',
 'CreationTime': datetime.datetime(2021, 9, 14, 13, 7, 31, 899000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2021, 9, 14, 14, 1, 13, 247000, tzinfo=tzlocal()),
 'MonitoringScheduleConfig': {'ScheduleConfig': {'ScheduleExpression': 'cron(0 * ? * * *)'},
  'MonitoringJobDefinitionName': 'data-quality-job-definition-2021-09-14-13-07-31-483',
  'MonitoringType': 'DataQuality'},
 'EndpointName': 'ch-dq-nh-try21',
 'LastMonitoringExecutionSummary': {'MonitoringScheduleName': 'ch-dq-monitor-schdl-try21',
  'ScheduledTime': datetime.datetime(2021, 9, 14, 14, 0, tzinfo=tzlocal()),
  'CreationTime': datetime.datetime(2021, 9, 14, 14, 1, 9, 405000, tzinfo=tzlocal()),
  'LastModifiedTime': datetime.datetime(2021, 9, 14, 14, 1, 13, 236000, tzinfo=tzlocal()),
  'MonitoringExecutionStatus': 'Failed',
  'EndpointName': 'ch-dq-nh-try21',
  'FailureReason': 'Job inputs had no data'},
 'ResponseMetadata': {'RequestId': 'dd729244-fde9-44b5-9904-066eea3a49bb',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': 'dd729244-fde9-44b5-9904-066eea3a49bb',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '835',
   'date': 'Tue, 14 Sep 2021 14:27:53 GMT'},
  'RetryAttempts': 0}}
```

您可能认为我这边出现问题或可能帮助我解决问题的可能事情:

  1. 用于基线的数据集:我尝试使用包含和不包含目标变量(或因变量或 y)的数据集创建基线,但两次错误都存在。所以,我认为这个错误是由不同的原因引起的。
  2. 没有为这些作业创建日志组供我查看并尝试调试问题。基线作业具有日志组,因此我认为用于监控计划作业的角色没有创建日志组或流的权限。
  3. 角色:我附加的角色由get_execution_role()定义,它指向一个可以完全访问sagemaker、cloudwatch、S3和一些其他服务的角色。
  4. 推理期间从端点收集的数据:这是 .jsonl 文件的一行数据如何保存到 S3,其中包含推理期间收集的数据,如下所示:
{"captureData":{"endpointInput":{"observedContentType":"application/json","mode":"INPUT","data":"{\"longitude\": [-122.32, -117.58], \"latitude\": [37.55, 33.6], \"housing_median_age\": [50.0, 5.0], \"total_rooms\": [2501.0, 5348.0], \"total_bedrooms\": [433.0, 659.0], \"population\": [1050.0, 1862.0], \"households\": [410.0, 555.0], \"median_income\": [4.6406, 11.0567]}","encoding":"JSON"},"endpointOutput":{"observedContentType":"text/html; charset=utf-8","mode":"OUTPUT","data":"eyJtZWRpYW5faG91c2VfdmFsdWUiOiBbNDUyOTU3LjY5LCA0NjcyMTQuNF19","encoding":"BASE64"}},"eventMetadata":{"eventId":"9804d438-eb4c-4cb4-8f1b-d0c832b641aa","inferenceId":"ef07163d-ea2d-4730-92f3-d755bc04ae0d","inferenceTime":"2021-09-14T13:59:03Z"},"eventVersion":"0"}

我想知道在整个过程中出了什么问题,导致数据没有提供给我的监控工作。

当 spark 在“/opt/ml/processing/groundtruth/”或“/opt/ml/processing/input_data/”目录中找不到任何数据时,会在地面实况合并作业期间发生这种情况。当您没有向 sagemaker 端点发送任何请求或者没有基本事实时,就会发生这种情况。

我收到此错误是因为映射到监控容器的 docker 卷的文件夹 /opt/ml/processing/input_data/ 没有要处理的数据。发生这种情况是因为,在 S3 中找不到任何促进整个过程的东西,包括获取数据。发生这种情况是因为在将保存端点的捕获数据的目录中有一个额外的斜杠(/)。详细说明,在创建端点时,我提到的目录是 s3://<bucket-name>/<folder-1>/,而它应该是 s3://<bucket-name>/<folder-1>。因此,虽然将数据从 S3 复制到 docker 卷的东西试图获取该小时的数据,但它试图从中提取数据的目录是 s3://<bucket-name>/<folder-1>//<endpoint-name>/<variant-name>/<year>/<month>/<date>/<hour>(注意两个斜线)。因此,当我再次创建端点配置并删除 S3 目录中的斜杠时,此错误不存在,并且作为模型质量监控的一部分,地面实况合并操作成功。

我回答这个问题是因为有人阅读了这个问题并给它投了赞成票。意思是,其他人也遇到了这个问题。所以,我已经提到了对我有用的东西。我写了这篇文章,这样 StackExchange 就不会认为我在向论坛发送垃圾邮件。