如何在 cpp 中使用 apache 箭头读取多个镶木地板文件或目录

Question

我是 apache arrow cpp 的新手 api。我想使用 apache arrow cpp api 读取多个镶木地板文件，类似于使用 python api（作为 table）的 apache arrow 中的内容。但是我没有看到任何例子。我知道我可以使用 :

读取单个镶木地板文件

   arrow::Status st;
   arrow::MemoryPool* pool = arrow::default_memory_pool();
   arrow::fs::LocalFileSystem file_system;
   std::shared_ptr<arrow::io::RandomAccessFile> input = file_system.OpenInputFile("/tmp/data.parquet").ValueOrDie();

   // Open Parquet file reader
   std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
   st = parquet::arrow::OpenFile(input, pool, &arrow_reader);

如果您有任何问题，请告诉我。提前致谢

Answer 1

该功能称为“数据集”

这里有一个相当完整的例子：https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/examples/arrow/dataset_parquet_scan_example.cc

该功能的 C++ 文档位于：https://arrow.apache.org/docs/cpp/dataset.html

我正在为这本食谱编写食谱，但我可以 post 这里的一些片段。这些来自正在进行的工作：https://github.com/westonpace/arrow-cookbook/blob/feature/basic-dataset-read/cpp/code/datasets.cc

基本上你会想要创建一个文件系统和select一些文件：

  // Create a filesystem
  std::shared_ptr<arrow::fs::LocalFileSystem> fs =
      std::make_shared<arrow::fs::LocalFileSystem>();

  // Create a file selector which describes which files are part of
  // the dataset.  This selector performs a recursive search of a base
  // directory which is typical with partitioned datasets.  You can also
  // create a dataset from a list of one or more paths.
  arrow::fs::FileSelector selector;
  selector.base_dir = directory_base;
  selector.recursive = true;

然后您将要创建一个数据集工厂和一个数据集：

  // Create a file format which describes the format of the files.
  // Here we specify we are reading parquet files.  We could pick a different format
  // such as Arrow-IPC files or CSV files or we could customize the parquet format with
  // additional reading & parsing options.
  std::shared_ptr<arrow::dataset::ParquetFileFormat> format =
      std::make_shared<arrow::dataset::ParquetFileFormat>();

  // Create a partitioning factory.  A partitioning factory will be used by a dataset
  // factory to infer the partitioning schema from the filenames.  All we need to specify
  // is the flavor of partitioning which, in our case, is "hive".
  //
  // Alternatively, we could manually create a partitioning scheme from a schema.  This is
  // typically not necessary for hive partitioning as inference works well.
  std::shared_ptr<arrow::dataset::PartitioningFactory> partitioning_factory =
      arrow::dataset::HivePartitioning::MakeFactory();

  arrow::dataset::FileSystemFactoryOptions options;
  options.partitioning = partitioning_factory;

  // Create a dataset factory
  ASSERT_OK_AND_ASSIGN(
      std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory,
      arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options));

  // Create the dataset, this will scan the dataset directory to find all of the files
  // and may scan some file metadata in order to determine the dataset schema.
  ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Dataset> dataset,
                       dataset_factory->Finish());

最后，您将要“扫描”数据集以获取数据：

  // Create a scanner
  arrow::dataset::ScannerBuilder scanner_builder(dataset);
  ASSERT_OK(scanner_builder.UseAsync(true));
  ASSERT_OK(scanner_builder.UseThreads(true));
  ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Scanner> scanner,
                       scanner_builder.Finish());

  // Scan the dataset.  There are a variety of other methods available on the scanner as
  // well
  ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::Table> table, scanner->ToTable());
  rout << "Read in a table with " << table->num_rows() << " rows and "
       << table->num_columns() << " columns";

如何在 cpp 中使用 apache 箭头读取多个镶木地板文件或目录

How to Read multiple parquet files or a directory using apache arrow in cpp

c++

parquet

apache-arrow