如何从 Apache Arrow 中的计算函数中检索标量值
How to Retrieve a Scalar Value from a Compute Function in Apache Arrow
我正在遍历箭头数组的元素并尝试将计算函数应用于每个标量,它会告诉我每个元素的年、月、日等...。代码看起来像这样:
arrow::NumericArray<arrow::Date32Type> array = {...}
for (int64_t i = 0; i < array.length(); i++) {
arrow::Result<std::shared_ptr<arrow::Scalar>> result = array->GetScalar(i);
if (!result.ok()) {
// TODO: handle error
}
arrow::Result<arrow::Datum> year = arrow::compute::Year(*result);
}
但是,我不太清楚如何从 arrow::compute::Year
调用中提取实际的 int64_t 值。我曾尝试过
const std::shared_ptr<int64_t> val = year.ValueOrDie();
>>> 'arrow::Datum' to non-scalar type 'const std::shared_ptr<long int>' requested
我试过类似地只分配给一个 int64_t
也失败了 error: cannot convert 'arrow::Datum' to 'int64_t'
我没有看到 Datum
class 的任何方法否则 return 我认为 arrow::compute::Year
应该是 return 原始类型的标量值=27=]ing。知道我对 Datum / Scalar / Compute API 有什么误解吗?
Arrow 的计算函数真正适用于数组而不是标量,否则开销会使操作效率低下。 arrow::compute::Year
函数接受一个 Datum
。这是一个方便的项,可以是标量、数组、ArrayData、RecordBatch 或 Table。并非所有函数都接受 Datum 的所有可能值(特别是,许多函数不接受 RecordBatch 或 Table)。
得到结果后,有几种方法可以获取数据,抓取单个标量可能效率最低,尤其是如果您提前知道数据类型(在此如果我们知道类型将是 int64_t)。这是因为标量是围绕某个值进行类型擦除的包装器(例如 python 或 java 中的“对象”),它会带来一些开销。
所以我的建议是:
// If you are going to be passing your array through the compute
// infrastructure you'll need to have it in a shared_ptr.
// Also, NumericArray is a base class so you don't often need
// to refer to it directly. You'll typically be getting one of the
// concrete subclasses like Date32Array
std::shared_ptr<arrow::Date32Array> array = {...}
// A datum can be implicitly constructed from a shared_ptr to an
// array. You could also explicitly construct it if that is more
// comfortable to you. Here `array` is being implicitly cast to a Datum.
ARROW_ASSIGN_OR_RAISE(arrow::Datum year_datum, arrow::compute::Year(array));
// Now we have a datum, but the docs tell us the return value from the
// `Year` function is always an array, so lets just unwrap it. This is
// something that could probably be improved in Arrow (might as well
// return an array)
std::shared_ptr<arrow::Array> years_arr = year_datum.make_array();
// Also, we know that the data type is Int64 so let's go ahead and
// cast further
std::shared_ptr<arrow::Int64Array> years = std::dynamic_pointer_cast<arrow::Int64Array>(years_arr);
// The concrete classes can be iterated in a variety of ways. GetScalar
// is the least efficient (but doesn't require knowing the type up front)
// Since we know the type (we've cast to Int64Array) we can use Value
// to get a single int64_t, raw_values() to get a const int64_t* (e.g a
// C-style array) or, perhaps the simplest, begin() and end() to get STL
// compliant iterators of int64_t
for (int64_t year : years) {
std::cout << "Year: " << year << std::endl;
}
如果您真的想使用标量:
arrow::Array array = {...}
for (int64_t i = 0; i < array.length(); i++) {
arrow::Result<std::shared_ptr<arrow::Scalar>> result = array->GetScalar(i);
if (!result.ok()) {
// TODO: handle error
}
ARROW_ASSIGN_OR_RAISE(Datum year_datum, arrow::compute::Year(*result));
std::shared_ptr<arrow::Scalar> year_scalar = year_datum.scalar();
std::shared_ptr<arrow::Int64Scalar> year_scalar_int = std::dynamic_pointer_cast<arrow::Int64Scalar>(year_scalar);
int64_t year = year_scalar_int->value;
}
我正在遍历箭头数组的元素并尝试将计算函数应用于每个标量,它会告诉我每个元素的年、月、日等...。代码看起来像这样:
arrow::NumericArray<arrow::Date32Type> array = {...}
for (int64_t i = 0; i < array.length(); i++) {
arrow::Result<std::shared_ptr<arrow::Scalar>> result = array->GetScalar(i);
if (!result.ok()) {
// TODO: handle error
}
arrow::Result<arrow::Datum> year = arrow::compute::Year(*result);
}
但是,我不太清楚如何从 arrow::compute::Year
调用中提取实际的 int64_t 值。我曾尝试过
const std::shared_ptr<int64_t> val = year.ValueOrDie();
>>> 'arrow::Datum' to non-scalar type 'const std::shared_ptr<long int>' requested
我试过类似地只分配给一个 int64_t
也失败了 error: cannot convert 'arrow::Datum' to 'int64_t'
我没有看到 Datum
class 的任何方法否则 return 我认为 arrow::compute::Year
应该是 return 原始类型的标量值=27=]ing。知道我对 Datum / Scalar / Compute API 有什么误解吗?
Arrow 的计算函数真正适用于数组而不是标量,否则开销会使操作效率低下。 arrow::compute::Year
函数接受一个 Datum
。这是一个方便的项,可以是标量、数组、ArrayData、RecordBatch 或 Table。并非所有函数都接受 Datum 的所有可能值(特别是,许多函数不接受 RecordBatch 或 Table)。
得到结果后,有几种方法可以获取数据,抓取单个标量可能效率最低,尤其是如果您提前知道数据类型(在此如果我们知道类型将是 int64_t)。这是因为标量是围绕某个值进行类型擦除的包装器(例如 python 或 java 中的“对象”),它会带来一些开销。
所以我的建议是:
// If you are going to be passing your array through the compute
// infrastructure you'll need to have it in a shared_ptr.
// Also, NumericArray is a base class so you don't often need
// to refer to it directly. You'll typically be getting one of the
// concrete subclasses like Date32Array
std::shared_ptr<arrow::Date32Array> array = {...}
// A datum can be implicitly constructed from a shared_ptr to an
// array. You could also explicitly construct it if that is more
// comfortable to you. Here `array` is being implicitly cast to a Datum.
ARROW_ASSIGN_OR_RAISE(arrow::Datum year_datum, arrow::compute::Year(array));
// Now we have a datum, but the docs tell us the return value from the
// `Year` function is always an array, so lets just unwrap it. This is
// something that could probably be improved in Arrow (might as well
// return an array)
std::shared_ptr<arrow::Array> years_arr = year_datum.make_array();
// Also, we know that the data type is Int64 so let's go ahead and
// cast further
std::shared_ptr<arrow::Int64Array> years = std::dynamic_pointer_cast<arrow::Int64Array>(years_arr);
// The concrete classes can be iterated in a variety of ways. GetScalar
// is the least efficient (but doesn't require knowing the type up front)
// Since we know the type (we've cast to Int64Array) we can use Value
// to get a single int64_t, raw_values() to get a const int64_t* (e.g a
// C-style array) or, perhaps the simplest, begin() and end() to get STL
// compliant iterators of int64_t
for (int64_t year : years) {
std::cout << "Year: " << year << std::endl;
}
如果您真的想使用标量:
arrow::Array array = {...}
for (int64_t i = 0; i < array.length(); i++) {
arrow::Result<std::shared_ptr<arrow::Scalar>> result = array->GetScalar(i);
if (!result.ok()) {
// TODO: handle error
}
ARROW_ASSIGN_OR_RAISE(Datum year_datum, arrow::compute::Year(*result));
std::shared_ptr<arrow::Scalar> year_scalar = year_datum.scalar();
std::shared_ptr<arrow::Int64Scalar> year_scalar_int = std::dynamic_pointer_cast<arrow::Int64Scalar>(year_scalar);
int64_t year = year_scalar_int->value;
}