AWS Sagemaker ValueError: Unsupported dtype object on array when using strings and dates
AWS Sagemaker ValueError: Unsupported dtype object on array when using strings and dates
我有一个 CSV 文件,我正在尝试对其进行 RCF。如果我在 CSV 中放入日期或字符串,则会收到如下所示的错误。如果我将它限制为仅整数和浮点字段,脚本运行良好。有什么方法可以处理日期和字符串吗?我看到了 AWS 的出租车例子,它的日期和我的一样
eventData = pd.read_csv(data_location, delimiter=",", header=None, parse_dates=True)
print('Starting RCF Training')
# specify general training job information
rcf = RandomCutForest(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m4.xlarge',
data_location=data_location,
output_path='s3://{}/{}/output'.format(bucket, prefix),
base_job_name="ad-rcf",
num_samples_per_tree=512,
num_trees=50)
rcf.fit(rcf.record_set(eventData.values))
失败的 CSV 数据
392507,1613744,1/2/2020 19:11,1577238693,2469,3.30E+01,-9.67E+01
691381,1888551,12/10/2019 9:22,1575641745,3460,2.37E+01,9.04E+01
392507,1613744,1/2/2020 19:20,1577236815,1797,3.30E+01,-9.67E+01
392507,1613744,1/29/2020 19:04,1577264188,1797,3.30E+01,-9.67E+01
错误输出
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-ba19bf5d66a2> in <module>
---> 21 rcf.fit(rcf.record_set(eventData.values))
22
23 print('Done RCF Training')
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in record_set(self, train, labels, channel, encrypt)
281 logger.debug("Uploading to bucket %s and key_prefix %s", bucket, key_prefix)
282 manifest_s3_file = upload_numpy_to_s3_shards(
--> 283 self.instance_count, s3, bucket, key_prefix, train, labels, encrypt
284 )
285 logger.debug("Created manifest file %s", manifest_s3_file)
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
443 s3.Object(bucket, key_prefix + file).delete()
444 finally:
--> 445 raise ex
446
447
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
424 write_numpy_to_dense_tensor(file, shard, label_shards[shard_index])
425 else:
--> 426 write_numpy_to_dense_tensor(file, shard)
427 file.seek(0)
428 shard_index_string = str(shard_index).zfill(len(str(len(shards))))
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
154 )
155 resolved_label_type = _resolve_type(labels.dtype)
--> 156 resolved_type = _resolve_type(array.dtype)
157
158 # Write each vector in array into a Record in the file object
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
288 if dtype == np.dtype("float32"):
289 return "Float32"
--> 290 raise ValueError("Unsupported dtype {} on array".format(dtype))
291
292
ValueError: Unsupported dtype object on array
解决了我的问题,RCF 无法处理日期和字符串。 AWS 提供的 Kenesis 产品页面包含相同的随机砍伐森林算法 https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html 它说该函数仅支持“该算法接受 DOUBLE、INTEGER、FLOAT、TINYINT、SMALLINT、REAL 和 BIGINT 数据类型。 “
AWS 在 NYC Taxi 示例中遇到的问题是他们使用的 .value 仅指数据的值列。他们基本上是将 RCF 中的日期作为一项功能删除。数组上的 .values 确实有效并且看起来与 .value
非常相似也无济于事
我有一个 CSV 文件,我正在尝试对其进行 RCF。如果我在 CSV 中放入日期或字符串,则会收到如下所示的错误。如果我将它限制为仅整数和浮点字段,脚本运行良好。有什么方法可以处理日期和字符串吗?我看到了 AWS 的出租车例子,它的日期和我的一样
eventData = pd.read_csv(data_location, delimiter=",", header=None, parse_dates=True)
print('Starting RCF Training')
# specify general training job information
rcf = RandomCutForest(role=sagemaker.get_execution_role(),
instance_count=1,
instance_type='ml.m4.xlarge',
data_location=data_location,
output_path='s3://{}/{}/output'.format(bucket, prefix),
base_job_name="ad-rcf",
num_samples_per_tree=512,
num_trees=50)
rcf.fit(rcf.record_set(eventData.values))
失败的 CSV 数据
392507,1613744,1/2/2020 19:11,1577238693,2469,3.30E+01,-9.67E+01
691381,1888551,12/10/2019 9:22,1575641745,3460,2.37E+01,9.04E+01
392507,1613744,1/2/2020 19:20,1577236815,1797,3.30E+01,-9.67E+01
392507,1613744,1/29/2020 19:04,1577264188,1797,3.30E+01,-9.67E+01
错误输出
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-35-ba19bf5d66a2> in <module>
---> 21 rcf.fit(rcf.record_set(eventData.values))
22
23 print('Done RCF Training')
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in record_set(self, train, labels, channel, encrypt)
281 logger.debug("Uploading to bucket %s and key_prefix %s", bucket, key_prefix)
282 manifest_s3_file = upload_numpy_to_s3_shards(
--> 283 self.instance_count, s3, bucket, key_prefix, train, labels, encrypt
284 )
285 logger.debug("Created manifest file %s", manifest_s3_file)
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
443 s3.Object(bucket, key_prefix + file).delete()
444 finally:
--> 445 raise ex
446
447
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/amazon_estimator.py in upload_numpy_to_s3_shards(num_shards, s3, bucket, key_prefix, array, labels, encrypt)
424 write_numpy_to_dense_tensor(file, shard, label_shards[shard_index])
425 else:
--> 426 write_numpy_to_dense_tensor(file, shard)
427 file.seek(0)
428 shard_index_string = str(shard_index).zfill(len(str(len(shards))))
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in write_numpy_to_dense_tensor(file, array, labels)
154 )
155 resolved_label_type = _resolve_type(labels.dtype)
--> 156 resolved_type = _resolve_type(array.dtype)
157
158 # Write each vector in array into a Record in the file object
/opt/conda/lib/python3.7/site-packages/sagemaker/amazon/common.py in _resolve_type(dtype)
288 if dtype == np.dtype("float32"):
289 return "Float32"
--> 290 raise ValueError("Unsupported dtype {} on array".format(dtype))
291
292
ValueError: Unsupported dtype object on array
解决了我的问题,RCF 无法处理日期和字符串。 AWS 提供的 Kenesis 产品页面包含相同的随机砍伐森林算法 https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sqlrf-random-cut-forest.html 它说该函数仅支持“该算法接受 DOUBLE、INTEGER、FLOAT、TINYINT、SMALLINT、REAL 和 BIGINT 数据类型。 “
AWS 在 NYC Taxi 示例中遇到的问题是他们使用的 .value 仅指数据的值列。他们基本上是将 RCF 中的日期作为一项功能删除。数组上的 .values 确实有效并且看起来与 .value
非常相似也无济于事