如何使用 OpenMP 并行化最近邻搜索

Question

基本上，我有一个 collection std::vector<std::pair<std::vector<float>, unsigned int>>，其中包含大小为 512 (2048 bytes) 的模板对 std::vector<float> 及其相应的标识符 unsigned int .

我正在编写一个函数，其中提供了一个模板，我需要 return collection 中最相似模板的标识符。我正在使用点积来计算相似度。

我的天真实现如下所示：

// Should return false if no match is found (ie. similarity is 0 for all templates in collection)
bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    bool found = false;
    similarity = 0.f;

    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data();
        float consinSimilarity = getSimilarity(data, candidateTemplate, length); // computes cosin sim between two vectors, implementation depends on architecture. 

        if (consinSimilarity > similarity) {
            found = true;
            similarity = consinSimilarity;
            label = collection[i].second;
        }
    }

    return found;
}

我如何使用并行化来加快速度。我的 collection 可能包含数百万个模板。我读到你可以添加 #pragma omp parallel for reduction，但我不完全确定如何使用它（如果这是最好的选择）。

另请注意：对于我的点积实现，如果基础架构支持 AVX 和 FMA，我将使用实现。由于只有有限数量的 SIMD 寄存器，这会影响我们并行化时的性能吗？

Answer 1

由于我们无法访问实际编译的示例（这本来不错），我实际上并没有尝试编译下面的示例。尽管如此，除了一些小的拼写错误（可能）之外，总体思路应该很清楚。

的任务是找到相似度的最大值和对应的标签，为此我们确实可以使用reduction，但是由于我们需要找到一个值的最大值然后存储对应的标签，我们使用一对来同时存储两个值，以便在 OpenMP 中将其实现为 reduction。

我稍微重写了您的代码，可能会使变量的原始命名 (temp) 更难阅读。基本上，我们并行执行搜索，因此每个线程都找到一个最优值，然后我们要求 OpenMP 找到线程之间的最优解 (reduction)，然后我们就完成了。

//Reduce by finding the maximum and also storing the corresponding label, this is why we use a std::pair. 
void reduce_custom (std::pair<float, unsigned int>& output, std::pair<float, unsigned int>& input) {
    if (input.first > output.first) output = input;
}
//Declare an OpenMP reduction with our pair and our custom reduction function. 
#pragma omp declare reduction(custom_reduction : \
    std::pair<float, unsigned int>: \
    reduce_custom(omp_out, omp_in)) \
    initializer(omp_priv(omp_orig))

bool identify(const float* data, unsigned int length, unsigned int& label, float& similarity) {
    std::pair<float, unsigned int> temp(0.0, label); //Stores thread local similarity and corresponding best label. 

#pragma omp parallel for reduction(custom_reduction:temp)
    for (size_t i = 0; i < collection.size(); ++i) {
        const float* candidateTemplate = collection[i].first.data(); 
        float consinSimilarity = getSimilarity(data, candidateTemplate, length);

        if (consinSimilarity > temp.first) {
            temp.first = consinSimilarity;
            temp.second = collection[i].second;
        }
    }

    if (temp.first > 0.f) {
        similarity = temp.first;
        label = temp.second;
        return true;
    }

    return false;
}

关于您对 SIMD 寄存器数量有限的担忧，它们的数量取决于您使用的特定 CPU。据我所知，每个内核都有一定数量的可用向量寄存器，因此只要您使用的向量寄存器数量不超过之前可用的数量，现在也应该没问题，此外，例如 AVX512 提供 32 个向量寄存器和 2 个向量寄存器每个内核的矢量运算的算术单位，因此计算资源不足运行并非微不足道，您更有可能因内存局部性差而受苦（特别是在您将矢量保存在各处的情况下）。我当然可能是错的，如果是这样，请随时在评论中纠正我。

如何使用 OpenMP 并行化最近邻搜索

How to parallelize nearest neighbour search using OpenMP

c++

optimization

multithreading

openmp

nearest-neighbor