AWS Sagemaker ValueError: Unsupported dtype object on array when using strings and dates

AWS Sagemaker ValueError: Unsupported dtype object on array when using strings and dates

我有一个 CSV 文件,我正在尝试对其进行 RCF。如果我在 CSV 中放入日期或字符串,则会收到如下所示的错误。如果我将它限制为仅整数和浮点字段,脚本运行良好。有什么方法可以处理日期和字符串吗?我看到了 AWS 的出租车例子,它的日期和我的一样

eventData = pd.read_csv(data_location, delimiter=",", header=None, parse_dates=True)

print('Starting RCF Training')
# specify general training job information
rcf = RandomCutForest(role=sagemaker.get_execution_role(),
                      instance_count=1,
                      instance_type='ml.m4.xlarge',
                      data_location=data_location,
                      output_path='s3://{}/{}/output'.format(bucket, prefix),
                      base_job_name="ad-rcf",
                      num_samples_per_tree=512,
                      num_trees=50)

rcf.fit(rcf.record_set(eventData.values))

失败的 CSV 数据

392507,1613744,1/2/2020 19:11,1577238693,2469,3.30E+01,-9.67E+01
691381,1888551,12/10/2019 9:22,1575641745,3460,2.37E+01,9.04E+01
392507,1613744,1/2/2020 19:20,1577236815,1797,3.30E+01,-9.67E+01
392507,1613744,1/29/2020 19:04,1577264188,1797,3.30E+01,-9.67E+01

错误输出

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-ba19bf5d66a2> in <module>
---> 21 rcf.fit(rcf.record_set(eventData.values))
     22 
     23 print('Done RCF Training')

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in record_set(self, train, labels, channel, encrypt)
    281         logger.debug("Uploading to bucket %s and key_prefix %s", bucket, key_prefix)
    282         manifest_s3_file = upload_numpy_to_s3_shards(
--> 283             self.instance_count, s3, bucket, key_prefix, train, labels, encrypt
    284         )
    285         logger.debug("Created manifest file %s", manifest_s3_file)

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
    443                 s3.Object(bucket, key_prefix + file).delete()
    444         finally:
--> 445             raise ex
    446 
    447 

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
    424                     write_numpy_to_dense_tensor(file, shard, label_shards[shard_index])
    425                 else:
--> 426                     write_numpy_to_dense_tensor(file, shard)
    427                 file.seek(0)
    428                 shard_index_string = str(shard_index).zfill(len(str(len(shards))))

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
    154             )
    155         resolved_label_type = _resolve_type(labels.dtype)
--> 156     resolved_type = _resolve_type(array.dtype)
    157 
    158     # Write each vector in array into a Record in the file object

/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
    288     if dtype == np.dtype("float32"):
    289         return "Float32"
--> 290     raise ValueError("Unsupported dtype {} on array".format(dtype))
    291 
    292 

ValueError: Unsupported dtype object on array

解决了我的问题,RCF 无法处理日期和字符串。 AWS 提供的 Kenesis 产品页面包含相同的随机砍伐森林算法 https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html 它说该函数仅支持“该算法接受 DOUBLE、INTEGER、FLOAT、TINYINT、SMALLINT、REAL 和 BIGINT 数据类型。 “

AWS 在 NYC Taxi 示例中遇到的问题是他们使用的 .value 仅指数据的值列。他们基本上是将 RCF 中的日期作为一项功能删除。数组上的 .values 确实有效并且看起来与 .value

非常相似也无济于事