张量流动态范围量化

tensorflow dynamic range quantization

动态范围量化的张量流文档指出:

At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.

并且在动态范围量化中,激活总是存储在浮点数 32 中,但是,它们在处理时被转换为 8 位整数,并在处理完成后返回浮点数。

我很困惑,如果在推理时将权重转换为 float32,那么如何进行量化?

引自https://www.tensorflow.org/lite/performance/post_training_quant

In addition, TFLite supports on the fly quantization and dequantization of activations to allow for:

Using quantized kernels for faster implementation when available. Mixing of floating-point kernels with quantized kernels for different parts of the graph.

如果内核具有支持量化的优化路径,则对浮点激活进行量化以应用量化权重。

否则,激活保持为float,权重将转换为float进行推理。