Apache Arrow Bus Error/Seg 使用 Python 绑定时出错

Apache Arrow Bus Error/Seg Fault when using Python bindings

我正在将数据写入 parquet 文件。 Apache Arrow 为此提供了一个简单的示例:parquet-arrow,其中数据流本质上是:data => arrow::ArrayBuilder => arrow::Array => arrow::Table => parquet 文件.这作为独立的 C++ 工作正常,但是当我尝试将此代码绑定到 python 模块并从 python 调用它时(我使用的是 Python 3.8.0),出现总线错误10(或段错误 11)始终出现在 arrow::ArrayBuilder => arrow::Arrays(即在 ArrayBuilder::Finish 函数中)。有谁知道为什么会发生这种情况或如何纠正它?

我尝试了一些调整来尝试解决这个问题,例如使用静态库链接与动态库链接,使用 ArrayBuilder::Finish 重载的变体,以及使用不同的工具来创建 python 模块/ .so(尝试了 pybind11 和 boost-python),但错误仍然存​​在。它在 arrow::ArrayBuilder::Finish(shared_ptrarrow::Array*) 中持续崩溃。我在 macOS 上 运行。这个简单的 .py 和 .cc 代码足以重现错误:

import pybindtest
pybindtest.python_bind_test()
#include <iostream>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/writer.h>
#include <pybind11/pybind11.h>

std::shared_ptr<arrow::Table> generate_table() {
  arrow::Int64Builder i64builder;
  std::shared_ptr<arrow::Array> i64array;
  PARQUET_THROW_NOT_OK(i64builder.AppendValues({2, 4}));
  PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));

  arrow::StringBuilder strbuilder;
  std::shared_ptr<arrow::Array> strarray;
  PARQUET_THROW_NOT_OK(strbuilder.Append("some"));
  PARQUET_THROW_NOT_OK(strbuilder.Append("content"));
  PARQUET_THROW_NOT_OK(strbuilder.Finish(&strarray));

  std::shared_ptr<arrow::Schema> schema = arrow::schema(
      {arrow::field("int", arrow::int64()), 
       arrow::field("str", arrow::utf8())});

  return arrow::Table::Make(schema, {i64array, strarray});
}

void write_parquet_file(const arrow::Table& table) {
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open("pybindtest.parquet"));
  PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}

void python_bind_test() {
  std::shared_ptr<arrow::Table> table = generate_table();
  write_parquet_file(*table);
}

PYBIND11_MODULE(pybindtest, m) {
  m.def("python_bind_test", &python_bind_test);
}

这是核心之一的回溯:

$ lldb -c core.84103 
(lldb) target create --core "core.84103"
Core file '/cores/core.84103' (x86_64) was loaded.

(lldb) bt
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff91b52a58 libc++abi.dylib`vtable for __cxxabiv1::__si_class_type_info + 16
    frame #1: 0x0000000103b1f4c8 libarrow.300.0.0.dylib`arrow::ArrayBuilder::Finish(std::__1::shared_ptr<arrow::Array>*) + 40
    frame #2: 0x0000000103a0c492 pybindtest.cpython-38-darwin.so`generate_table() + 642
    frame #3: 0x0000000103a0e298 pybindtest.cpython-38-darwin.so`python_bind_test() + 24
    frame #4: 0x0000000103a4425f pybindtest.cpython-38-darwin.so`void pybind11::detail::argument_loader<>::call_impl<void, void (*&)(), pybind11::detail::void_type>(void (*&)(), pybind11::detail::index_sequence<>, pybind11::detail::void_type&&) && + 31
    frame #5: 0x0000000103a44136 pybindtest.cpython-38-darwin.so`std::__1::enable_if<std::is_void<void>::value, pybind11::detail::void_type>::type pybind11::detail::argument_loader<>::call<void, pybind11::detail::void_type, void (*&)()>(void (*&)()) && + 54
    frame #6: 0x0000000103a43ff2 pybindtest.cpython-38-darwin.so`void pybind11::cpp_function::initialize<void (*&)(), void, pybind11::name, pybind11::scope, pybind11::sibling>(void (*&)(), void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::operator()(pybind11::detail::function_call&) const + 130
    frame #7: 0x0000000103a43f55 pybindtest.cpython-38-darwin.so`void pybind11::cpp_function::initialize<void (*&)(), void, pybind11::name, pybind11::scope, pybind11::sibling>(void (*&)(), void (*)(), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&)::'lambda'(pybind11::detail::function_call&)::__invoke(pybind11::detail::function_call&) + 21
    frame #8: 0x0000000103a2cb62 pybindtest.cpython-38-darwin.so`pybind11::cpp_function::dispatcher(_object*, _object*, _object*) + 4818
    frame #9: 0x00000001035cf164 python`cfunction_call_varargs + 68
    frame #10: 0x00000001035ce3a7 python`_PyObject_MakeTpCall + 167
    frame #11: 0x0000000103713228 python`_PyEval_EvalFrameDefault + 45944
    frame #12: 0x0000000103706060 python`_PyEval_EvalCodeWithName + 560
    frame #13: 0x0000000103780a7c python`PyRun_FileExFlags + 364
    frame #14: 0x0000000103780171 python`PyRun_SimpleFileExFlags + 529
    frame #15: 0x00000001037a8c5a python`pymain_run_file + 394
    frame #16: 0x00000001037a81b6 python`pymain_run_python + 486
    frame #17: 0x00000001037a7f88 python`Py_RunMain + 24
    frame #18: 0x00000001037a9670 python`pymain_main + 32
    frame #19: 0x00000001035a1cb9 python`main + 57
    frame #20: 0x00007fff6b8b7cc9 libdyld.dylib`start + 1
    frame #21: 0x00007fff6b8b7cc9 libdyld.dylib`start + 1

经进一步调查,此错误似乎是由我从源代码构建的 arrow-cpp 库与我从 conda-forge 安装的 pyarrow 包之间的某些冲突触发的。我能够简单地通过 pip 将 pyarrow 安装到我的 conda env 中而不是从 conda-forge 通道中拉出它来解决这个问题(在我的情况下也适用于 pyspark,因为它取决于 pyarrow)。

虽然我不知道这种不兼容的确切原因,但它可能与 Arrow Python Documentation 中提到的当前 MacOS 警告有关,其中指出:

Using conda to build Arrow on macOS is complicated by the fact that the conda-forge compilers require an older macOS SDK. Conda offers some installation instructions; the alternative would be to use Homebrew and pip instead.