Halide：如何避免 Halide LUT 索引中不必要的执行开销

Question

LUT 索引的输入值计算在多次调用中保持不变，因此我预先计算了 'indexToLut' 的内容。但是，这也意味着无法在此处对该缓冲区中的值进行检查。 LUT本身只有17个元素。

#define LUT_SIZE 17     /* Size in each dimension of the 4D LUT */

class ApplyLut : public Halide::Generator<ApplyLut> {
public:
    // We declare the Inputs to the Halide pipeline as public
    // member variables. They'll appear in the signature of our generated
    // function in the same order as we declare them.
  Input <  Buffer<uint8_t>> Lut              { "Lut"            , 1};  // LUT to apply
  Input <  Buffer<int>> indexToLut           { "indexToLut"     , 1};  // Precalculated mapping of uint8_t to LUT index
  Input <  Buffer<uint8_t >> inputImageLine  { "inputImageLine" , 1};  // Input line
  Output<  Buffer<uint8_t >> outputImageLine { "outputImageLine", 1};  // Output line
  void generate();
};

HALIDE_REGISTER_GENERATOR(ApplyLut, outputImageLine)

void ApplyLut::generate()
{
  Var x("x");

  outputImageLine(x) = Lut(indexToLut(inputImageLine(x)));

  inputImageLine .dim(0).set_min(0);         // Input image sample index
  outputImageLine.dim(0).set_bounds(0, inputImageLine.dim(0).extent()); // Output line matches input line
  Lut            .dim(0).set_bounds(0, LUT_SIZE);          //iccLut[...]: , limited number of values
  indexToLut     .dim(0).set_bounds(0, 256);    //chan4_offset[...]: value index: 256 values
}

在问题中，已经说明可以使用 'clamp' 功能解决此类问题。

这会将表达式更改为

  outputImageLine(x) = Lut(clamp(indexToLut(inputImageLine(x)), 0, LUT_SIZE));

然而，生成的代码显示如下表达式

outputImageLine[outputImageLine.s0.x] = Lut[max(min(indexToLut[int32(inputImageLine[outputImageLine.s0.x])], 17), 0)]

我认为这意味着执行将进行 min/max 评估，在我的情况下可以省略，因为我知道 indexToLut 的所有值都限制为 0..16。在这种情况下有没有办法避免执行开销？

Answer 1

您可以使用 unsafe_promise_clamped 而不是 clamp 来保证输入按照您描述的方式进行限制。虽然它可能不会更快 - 与间接加载相比，整数索引的最小值和最大值非常便宜。

Halide：如何避免 Halide LUT 索引中不必要的执行开销

Halide: How to avoid unwanted execution overhead in Halide LUT index

halide