张量流动态范围量化

tensorflow dynamic range quantization

动态范围量化的张量流文档指出：

At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.

并且在动态范围量化中，激活总是存储在浮点数 32 中，但是，它们在处理时被转换为 8 位整数，并在处理完成后返回浮点数。

我很困惑，如果在推理时将权重转换为 float32，那么如何进行量化？

引自https://www.tensorflow.org/lite/performance/post_training_quant

In addition, TFLite supports on the fly quantization and dequantization of activations to allow for:

Using quantized kernels for faster implementation when available. Mixing of floating-point kernels with quantized kernels for different parts of the graph.

如果内核具有支持量化的优化路径，则对浮点激活进行量化以应用量化权重。

否则，激活保持为float，权重将转换为float进行推理。

张量流动态范围量化

tensorflow dynamic range quantization

tensorflow

tensorflow-lite

tensorflow2.0

quantization-aware-training