C++ Apache Orc 没有正确过滤数据

Question

我正在发布一个简单的 C++ Apache orc 文件读取程序，其中：

从ORC文件中读取数据。

根据给定的字符串过滤数据。

示例代码：

#include <iostream>

#include <list>
#include <memory>
#include <chrono>

// Orc specific headers.
#include <orc/Reader.hh>
#include <orc/ColumnPrinter.hh>
#include <orc/Exceptions.hh>
#include <orc/OrcFile.hh>

int main(int argc, char const *argv[])
{
    auto begin = std::chrono::steady_clock::now();

    orc::RowReaderOptions m_RowReaderOpts;
    orc::ReaderOptions m_ReaderOpts;

    std::unique_ptr<orc::Reader> m_Reader;
    std::unique_ptr<orc::RowReader> m_RowReader;

    auto builder = orc::SearchArgumentFactory::newBuilder();
    std::string required_symbol("FILTERME");

    /// THIS LINE SHOULD FILTER DATA BASED ON COLUMNS.
    /// INSTEAD OF FILTERING IT TRAVERSE EACH ROW OF ORC FILE.
    builder->equals("column_name", orc::PredicateDataType::STRING, orc::Literal(required_symbol.c_str(), required_symbol.size()));

    std::string file_path("/orc/file/path.orc");
    
    m_Reader = orc::createReader(orc::readFile(file_path.c_str()), m_ReaderOpts);
    m_RowReader = m_Reader->createRowReader(m_RowReaderOpts);
    m_RowReaderOpts.searchArgument(builder->build());
    
    auto batch = m_RowReader->createRowBatch(5000);

    try
    {
        
        std::cout << builder->build()->toString() << std::endl;
        while(m_RowReader->next(*batch))
        {
            const auto &struct_batch = dynamic_cast<const orc::StructVectorBatch&>(*batch.get());
            /** DO CALCULATIONS */
        }
        
    }
    catch(const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }

    auto end = std::chrono::steady_clock::now();
    std::cout << "Total Time taken to read ORC file: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << " ms.\n";

    return 0;
}

我尝试在 google 上搜索将近一个星期，并尝试将每个可能的 java 程序转换为 C++ 以使我的代码正常工作。
我尝试使用中的示例，它有类似的问题但对我不起作用。

**题：**
1.我是否正确连接过滤代码。如果是，那么为什么它不根据给定的字符串过滤数据。
2. 我在哪里可以找到行级或条级过滤器的 C++ 或 'relevant Java code'。

Answer 1

经过多次尝试，终于解决了上述ORC数据过滤问题
这是因为使用了不正确的列号，我不确定为什么要获取的列的 column id 和要过滤的列之间存在差异。
在上面的示例中，我尝试使用 column name 过滤数据，但使用列名过滤 ORC 的问题仍然存在。但不幸的是，它在列号上工作正常。

新代码：

#include <iostream>

#include <list>
#include <memory>
#include <chrono>

// Orc specific headers.
#include <orc/Reader.hh>
#include <orc/ColumnPrinter.hh>
#include <orc/Exceptions.hh>
#include <orc/OrcFile.hh>

int main(int argc, char const *argv[])
{
    auto begin = std::chrono::steady_clock::now();

    orc::RowReaderOptions m_RowReaderOpts;
    orc::ReaderOptions m_ReaderOpts;

    std::unique_ptr<orc::Reader> m_Reader;
    std::unique_ptr<orc::RowReader> m_RowReader;

    auto builder = orc::SearchArgumentFactory::newBuilder();
    std::string required_symbol("FILTERME");

    // <-- HERE COLUMN IDS ARE STARTING FROM 0-N. -->
    std::list<uint64_t> cols = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    m_RowReaderOpts.include(cols);

    int column_id = 7; // IN cols ABOVE, THIS COLUMN_ID 7 IS ACTUALLY 6. WHICH MEANS COLUMN_ID TO FILTER COLUMN IS +1 OF COLUMN ID PROVIDED IN DATA FETCH.
    builder->equals(column_id, orc::PredicateDataType::STRING, orc::Literal(required_symbol.c_str(), required_symbol.size()));

    std::string file_path("/orc/file/path.orc");
    
    m_Reader = orc::createReader(orc::readFile(file_path.c_str()), m_ReaderOpts);
    m_RowReader = m_Reader->createRowReader(m_RowReaderOpts);
    m_RowReaderOpts.searchArgument(builder->build());
    
    auto batch = m_RowReader->createRowBatch(5000);

    try
    {
        
        std::cout << builder->build()->toString() << std::endl;
        while(m_RowReader->next(*batch))
        {
            const auto &struct_batch = dynamic_cast<const orc::StructVectorBatch&>(*batch.get());
            /** DO CALCULATIONS */
        }
        
    }
    catch(const std::exception& e)
    {
        std::cerr << e.what() << '\n';
    }

    auto end = std::chrono::steady_clock::now();
    std::cout << "Total Time taken to read ORC file: " << std::chrono::duration_cast<std::chrono::milliseconds>(end - begin).count() << " ms.\n";

    return 0;
}

根据我在解决上述问题时的理解，用于获取数据的列 ID 从 0-N 开始，用于过滤的列 ID 为 1-N。这就是为什么当您需要在 0.

列过滤数据时，您应该提供 1

Answer 2

解释中的困惑：

在 ORC 中，列字段 ID 与列类型 ID 不同：

对于将结构作为 top-level 对象的文件，字段 ID 0 对应于第一个结构字段，字段 ID 1 对应于第二个结构字段，依此类推。在此处查看评论：https://github.com/apache/orc/blob/v1.7.3/c++/include/orc/Reader.hh#L122-L123
列类型id是类型树的pre-order遍历索引。正如 spec 中提到的：类型树通过 pre-order 遍历被展平为一个列表，其中每个类型都被分配了下一个 id。显然，类型树的根始终是类型 id 0。

因此，如果 ORC 文件中没有嵌套类型 (struct/array/map)，我们可以在除根结构类型之外的所有列上看到 columnTypeId == columnFieldId + 1。

构建 sargs 时使用的 ID 是列类型 ID。但是，RowReaderOptions::include(const std::list<uint64_t>& include) 中使用的 id 是列字段 id。要具有一致的 ID 映射，我建议对类型 ID 使用 include 方法： RowReaderOptions::includeTypes(const std::list<uint64_t>& types);

C++ Apache Orc 没有正确过滤数据

C++ Apache Orc is not filtering data correctly

c++

orc