将 Apache Arrow table 写入字符串 C++

Write Apache Arrow table to string C++

我正在尝试将 Apache Arrow table 写入字符串。我的大例子有问题,我无法让这个小例子工作。这一个在 WriteTable 调用中出现在 Arrow 内部的段错误。我更大的例子似乎没有正确序列化。

#include <arrow/api.h>
#include <arrow/io/memory.h>
#include <arrow/ipc/api.h>
 
std::shared_ptr<arrow::Table> makeSimpleFakeArrowTable() {
    std::vector<std::shared_ptr<arrow::Field>> arrowFields;
    arrowFields.emplace_back(std::make_shared<arrow::Field>("Field1", arrow::int64()));
    arrowFields.emplace_back(std::make_shared<arrow::Field>("Field2", arrow::float64()));

    auto schema = std::make_shared<arrow::Schema>(arrowFields);

    std::vector<std::shared_ptr<arrow::Array>> columns(schema->num_fields());

    arrow::Int64Builder longBuilder;
    longBuilder.Append(20);
    longBuilder.Finish(&(columns.at(0)));
    arrow::DoubleBuilder doubleBuilder;
    doubleBuilder.Append(10.0);
    longBuilder.Finish(&(columns.at(1)));

    return arrow::Table::Make(schema, columns);
}

std::shared_ptr<arrow::RecordBatch>
getArrowBatchFromBytes(const std::string& bytes) {
    arrow::io::BufferReader arrowBufferReader{bytes};
    auto streamReader =
        arrow::ipc::RecordBatchStreamReader::Open(&arrowBufferReader).ValueOrDie();

    auto batch = streamReader->Next().ValueOrDie();

    return batch;
}


std::string arrowTableToByteString(const std::shared_ptr<arrow::Table>& table) {
    auto stream = arrow::io::BufferOutputStream::Create().ValueOrDie();
    auto batchWriter = arrow::ipc::MakeStreamWriter(stream, table->schema()).ValueOrDie();

    auto status = batchWriter->WriteTable(*table);
    if (not status.ok()) {
        throw std::runtime_error(
            "Couldn't write Arrow Table to byte string. Arrow status was: '" +
            status.ToString() + "'.");
    }

    std::shared_ptr<arrow::Buffer> buffer = stream->Finish().ValueOrDie();
    return buffer->ToHexString();
}

int main(int argc, char** argv) {
    auto simpleFakeArrowTable = makeSimpleFakeArrowTable();
    std::string tableAsByteString = arrowTableToByteString(simpleFakeArrowTable);

    auto batch = getArrowBatchFromBytes(tableAsByteString);
    assert(batch != nullptr);
}

我想到了两件事。首先,我认为这是一个错字:

    longBuilder.Finish(&(columns.at(0)));
    arrow::DoubleBuilder doubleBuilder;
    doubleBuilder.Append(10.0);
    longBuilder.Finish(&(columns.at(1))); // Shouldn't this be doubleBuilder?

每当您自己创建箭头 table 时,最好调用 arrow::Table::ValidateFull。这将有助于发现这样的错误(在这种情况下,状态 returned 会报告输入数组与模式不匹配)。

其次,如果我们解决这个问题,我们会收到一个错误,因为您 return buffer->ToHexString(); 会将您的字节数组转换为十六进制字符串(例如字节 [10, 20, 30]成为字节 [48, 48, 48, 65, 48, 48, 49, 52, 48, 48, 49, 69],通常表示为 000A0014001E).

然后您转身并尝试将这些十六进制字节读取为 table arrow::io::BufferReader arrowBufferReader{bytes};。如果我将 ToHexString 更改为 ToString,那么您的示例将运行并且 returns 0.