如何在 cpp 中使用 apache 箭头读取多个镶木地板文件或目录
How to Read multiple parquet files or a directory using apache arrow in cpp
我是 apache arrow cpp 的新手 api。
我想使用 apache arrow cpp api 读取多个镶木地板文件,类似于使用 python api(作为 table)的 apache arrow 中的内容。
但是我没有看到任何例子。
我知道我可以使用 :
读取单个镶木地板文件
arrow::Status st;
arrow::MemoryPool* pool = arrow::default_memory_pool();
arrow::fs::LocalFileSystem file_system;
std::shared_ptr<arrow::io::RandomAccessFile> input = file_system.OpenInputFile("/tmp/data.parquet").ValueOrDie();
// Open Parquet file reader
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
st = parquet::arrow::OpenFile(input, pool, &arrow_reader);
如果您有任何问题,请告诉我。
提前致谢
该功能称为“数据集”
这里有一个相当完整的例子:https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/examples/arrow/dataset_parquet_scan_example.cc
该功能的 C++ 文档位于:https://arrow.apache.org/docs/cpp/dataset.html
我正在为这本食谱编写食谱,但我可以 post 这里的一些片段。这些来自正在进行的工作:https://github.com/westonpace/arrow-cookbook/blob/feature/basic-dataset-read/cpp/code/datasets.cc
基本上你会想要创建一个文件系统和select一些文件:
// Create a filesystem
std::shared_ptr<arrow::fs::LocalFileSystem> fs =
std::make_shared<arrow::fs::LocalFileSystem>();
// Create a file selector which describes which files are part of
// the dataset. This selector performs a recursive search of a base
// directory which is typical with partitioned datasets. You can also
// create a dataset from a list of one or more paths.
arrow::fs::FileSelector selector;
selector.base_dir = directory_base;
selector.recursive = true;
然后您将要创建一个数据集工厂和一个数据集:
// Create a file format which describes the format of the files.
// Here we specify we are reading parquet files. We could pick a different format
// such as Arrow-IPC files or CSV files or we could customize the parquet format with
// additional reading & parsing options.
std::shared_ptr<arrow::dataset::ParquetFileFormat> format =
std::make_shared<arrow::dataset::ParquetFileFormat>();
// Create a partitioning factory. A partitioning factory will be used by a dataset
// factory to infer the partitioning schema from the filenames. All we need to specify
// is the flavor of partitioning which, in our case, is "hive".
//
// Alternatively, we could manually create a partitioning scheme from a schema. This is
// typically not necessary for hive partitioning as inference works well.
std::shared_ptr<arrow::dataset::PartitioningFactory> partitioning_factory =
arrow::dataset::HivePartitioning::MakeFactory();
arrow::dataset::FileSystemFactoryOptions options;
options.partitioning = partitioning_factory;
// Create a dataset factory
ASSERT_OK_AND_ASSIGN(
std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory,
arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options));
// Create the dataset, this will scan the dataset directory to find all of the files
// and may scan some file metadata in order to determine the dataset schema.
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Dataset> dataset,
dataset_factory->Finish());
最后,您将要“扫描”数据集以获取数据:
// Create a scanner
arrow::dataset::ScannerBuilder scanner_builder(dataset);
ASSERT_OK(scanner_builder.UseAsync(true));
ASSERT_OK(scanner_builder.UseThreads(true));
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Scanner> scanner,
scanner_builder.Finish());
// Scan the dataset. There are a variety of other methods available on the scanner as
// well
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::Table> table, scanner->ToTable());
rout << "Read in a table with " << table->num_rows() << " rows and "
<< table->num_columns() << " columns";
我是 apache arrow cpp 的新手 api。 我想使用 apache arrow cpp api 读取多个镶木地板文件,类似于使用 python api(作为 table)的 apache arrow 中的内容。 但是我没有看到任何例子。 我知道我可以使用 :
读取单个镶木地板文件 arrow::Status st;
arrow::MemoryPool* pool = arrow::default_memory_pool();
arrow::fs::LocalFileSystem file_system;
std::shared_ptr<arrow::io::RandomAccessFile> input = file_system.OpenInputFile("/tmp/data.parquet").ValueOrDie();
// Open Parquet file reader
std::unique_ptr<parquet::arrow::FileReader> arrow_reader;
st = parquet::arrow::OpenFile(input, pool, &arrow_reader);
如果您有任何问题,请告诉我。 提前致谢
该功能称为“数据集”
这里有一个相当完整的例子:https://github.com/apache/arrow/blob/apache-arrow-5.0.0/cpp/examples/arrow/dataset_parquet_scan_example.cc
该功能的 C++ 文档位于:https://arrow.apache.org/docs/cpp/dataset.html
我正在为这本食谱编写食谱,但我可以 post 这里的一些片段。这些来自正在进行的工作:https://github.com/westonpace/arrow-cookbook/blob/feature/basic-dataset-read/cpp/code/datasets.cc
基本上你会想要创建一个文件系统和select一些文件:
// Create a filesystem
std::shared_ptr<arrow::fs::LocalFileSystem> fs =
std::make_shared<arrow::fs::LocalFileSystem>();
// Create a file selector which describes which files are part of
// the dataset. This selector performs a recursive search of a base
// directory which is typical with partitioned datasets. You can also
// create a dataset from a list of one or more paths.
arrow::fs::FileSelector selector;
selector.base_dir = directory_base;
selector.recursive = true;
然后您将要创建一个数据集工厂和一个数据集:
// Create a file format which describes the format of the files.
// Here we specify we are reading parquet files. We could pick a different format
// such as Arrow-IPC files or CSV files or we could customize the parquet format with
// additional reading & parsing options.
std::shared_ptr<arrow::dataset::ParquetFileFormat> format =
std::make_shared<arrow::dataset::ParquetFileFormat>();
// Create a partitioning factory. A partitioning factory will be used by a dataset
// factory to infer the partitioning schema from the filenames. All we need to specify
// is the flavor of partitioning which, in our case, is "hive".
//
// Alternatively, we could manually create a partitioning scheme from a schema. This is
// typically not necessary for hive partitioning as inference works well.
std::shared_ptr<arrow::dataset::PartitioningFactory> partitioning_factory =
arrow::dataset::HivePartitioning::MakeFactory();
arrow::dataset::FileSystemFactoryOptions options;
options.partitioning = partitioning_factory;
// Create a dataset factory
ASSERT_OK_AND_ASSIGN(
std::shared_ptr<arrow::dataset::DatasetFactory> dataset_factory,
arrow::dataset::FileSystemDatasetFactory::Make(fs, selector, format, options));
// Create the dataset, this will scan the dataset directory to find all of the files
// and may scan some file metadata in order to determine the dataset schema.
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Dataset> dataset,
dataset_factory->Finish());
最后,您将要“扫描”数据集以获取数据:
// Create a scanner
arrow::dataset::ScannerBuilder scanner_builder(dataset);
ASSERT_OK(scanner_builder.UseAsync(true));
ASSERT_OK(scanner_builder.UseThreads(true));
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::dataset::Scanner> scanner,
scanner_builder.Finish());
// Scan the dataset. There are a variety of other methods available on the scanner as
// well
ASSERT_OK_AND_ASSIGN(std::shared_ptr<arrow::Table> table, scanner->ToTable());
rout << "Read in a table with " << table->num_rows() << " rows and "
<< table->num_columns() << " columns";