我可以确定地对任意排列的浮点数向量求和吗?

Can I deterministically sum a vector of arbitrarily arranged floating-point numbers?

假设我有一个(可能很大的)浮点数向量,它是由某个黑盒过程产生的。是否可以计算这些数字的按位可重复求和?

如果黑盒过程总是以相同的顺序生成数字,那么按位可重现的求和就很容易:只需从左到右对它们求和。

但是如果数字是随机生成的,可能是因为它们是从异步进程返回和收集的,那么就更难了:我必须对它们进行排序。

但是如果有更多的数字,可能分布在不同的机器上,以至于不希望移动它们怎么办?

还有办法对它们求和吗?

概述:多遍数据变异方法

(请参阅 以获得更好的答案。)

是的,有办法。

Demmel 和 Nguyen(2013 年)的论文“Fast Reproducible Floating-Point Summation”中描述了一种执行此操作的方法。我在下面包含了它的串行和并行实现。

这里我们将比较三种算法:

  1. 传统的从左到右求和:相对于输入顺序的排列,这是快速、不准确且不可重现的。
  2. Kahan summation:速度较慢,非常准确(输入大小基本上为 O(1)),并且,虽然在排列方面不可重现输入顺序,更接近狭义的再现性。
  3. 确定性求和:这比 Kahan 求和慢一些,可以非常准确,并且可以按位重现。

传统的求和是不准确的,因为随着总和的增长,我们添加到它的数字的最低有效位会被默默地删除。 Kahan 求和通过保持 运行ning 的“补偿”总和来克服这个问题,该总和保留在最低有效数字上。确定性求和使用类似的策略来保持准确性,但是在执行求和之前,将输入数字“预舍入”到一个公共基数,以便可以准确地计算它们的和而没有任何舍入误差。

测试

为了研究算法,我们将 运行 对 100 万个数字进行测试,每个数字均从 运行ge [-1000, 1000] 中统一抽取。为了证明算法的 运行ge 答案及其确定性,输入将被打乱并求和 100 次。并行算法 运行 使用 12 个线程。

“正确”求和(通过在 long double 模式下使用 Kahan 求和并选择最常出现的值来确定)是

310844.700699143717685046795

我要断言的一个结果(下面的代码严格证明了这一点)是,对于所研究的每种数据类型,确定性算法对 100 个数据排序中的每一个都产生相同的结果,并且这些结果对于算法的串行和并行变体。

各种算法的结果如下:

Serial Float
=========================================================
Simple summation accumulation type   = float
Average deterministic summation time = 0.00462s
Average simple summation time        = 0.00144s (3.20x faster than deterministic)
Average Kahan summation time         = 0.00290s (1.59x faster than deterministic)
Deterministic value                  = 256430592.000000000 (very different from actual)
Distinct Kahan values                = 3  (with single-precision accumulator)
Distinct Simple values               = 93 (with single-precision accumulator)
Distinct Simple values               = 1  (with double-precision accumulator)

Parallel Float
=========================================================
Simple summation accumulation type   = float
Average deterministic summation time = 0.00576s
Average simple summation time        = 0.00206s (2.79x faster than deterministic)
Average Kahan summation time         = 0.00599s (0.96x faster than deterministic)
Deterministic value                  = 256430592.000000000 (very different from actual)
Distinct Kahan values                = 3  (with single-precision accumulator)
Distinct Simple values               = 89 (with single-precision accumulator)
Distinct Simple values               = 1  (with double-precision accumulator)

Serial Double
=========================================================
Average deterministic summation time = 0.00600s
Average simple summation time        = 0.00171s (3.49x faster than deterministic)
Average Kahan summation time         = 0.00378s (1.58x faster than deterministic)
Deterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)
Distinct Kahan values                = 4
Distinct Simple values               = 88

Parallel Double
=========================================================
Average deterministic summation time = 0.01074s
Average simple summation time        = 0.00289s (3.71x faster than deterministic)
Average Kahan summation time         = 0.00648s (1.65x faster than deterministic)
Deterministic value                  = 310844.70069914375199005 (epsilon difference from actual value)
Distinct Kahan values                = 2
Distinct Simple values               = 83

Serial Long Double
=========================================================
Average deterministic summation time = 0.01072s
Average simple summation time        = 0.00215s (4.96x faster than deterministic)
Average Kahan summation time         = 0.00448s (2.39x faster than deterministic)
Deterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)
Distinct Kahan values                = 3
Distinct Simple values               = 94

Parallel Long Double
=========================================================
Average deterministic summation time = 0.01854s
Average simple summation time        = 0.00466s (3.97x faster than deterministic)
Average Kahan summation time         = 0.00635s (2.91x faster than deterministic)
Deterministic value                  = 310844.700699143717685046795 (no discernable difference from actual)
Distinct Kahan values                = 1
Distinct Simple values               = 82

讨论

那么,我们学到了什么?从单精度结果中我们看到,使用双倍长度累加器可以使传统的求和算法对该数据集可重现,尽管对于任意数据集几乎肯定不会出现这种情况。尽管如此,这仍然是一种在工作时提高准确性的廉价方法。

当常规求和使用的累加器与要相加的数字大小相同时,它的效果很差:我们 运行 进行了 100 次测试,得到了与常规求和几乎一样多的不同结果。

另一方面,Kahan 求和产生的不同结果非常少(因此它“更具”确定性)并且花费的时间大约是传统求和的两倍。

确定性算法比简单求和需要大约 4 倍的时间,比 Kahan 求和需要大约 1.5-3 倍的时间,但是除了按位确定性(它 无论输入数据如何排序,总是得到相同的答案。

然而,单精度浮点得到一个非常不好的答案,不像单精度常规求和和Kahan求和,结果非常接近实际回答。这是为什么?

我们一直在研究的论文确定如果输入中有 n 个数字并且我们使用 k 折叠轮(论文推荐 2,这就是我们在这里使用的),那么确定性和常规总和将具有相似的误差范围 provided

n^k * e^(k-1) << 1

其中 e 是数据类型的浮点小数。这些 epsilon 值是:

Single-precision epsilon      = 0.000000119
Double-precision epsilon      = 0.00000000000000022
Long double-precision epsilon = 0.000000000000000000108

将这些连同 n=1M 代入等式,我们得到:

Single condition      = 119000
Double condition      = 0.00022
Long double condition = 0.000000108

因此我们看到,对于这种大小的输入,单精度预计会表现不佳。

另一点需要注意的是,尽管 long double 在我的机器上占用 16 个字节,但这只是为了对齐目的:用于计算的真实长度仅为 80 位。因此,两个 long double 将在数值上进行相等的比较,但它们未使用的位是任意的。对于真正的按位再现性,需要确定未使用的位并将其设置为已知值。

代码

//Compile with:
//g++ -g -O3 repro_vector.cpp -fopenmp

//NOTE: Always comile with `-g`. It doesn't slow down your code, but does make
//it debuggable and improves ease of profiling

#include <algorithm>
#include <bitset>               //Used for showing bitwise representations
#include <cfenv>                //Used for setting floating-point rounding modes
#include <chrono>               //Used for timing algorithms
#include <climits>              //Used for showing bitwise representations
#include <iostream>
#include <omp.h>                //OpenMP
#include <random>
#include <stdexcept>
#include <string>
#include <typeinfo>
#include <unordered_map>
#include <vector>

constexpr int ROUNDING_MODE = FE_UPWARD;
constexpr int N = 1'000'000;
constexpr int TESTS = 100;

// Simple timer class for tracking cumulative run time of the different
// algorithms
struct Timer {
  double total = 0;
  std::chrono::high_resolution_clock::time_point start_time;

  Timer() = default;

  void start() {
    start_time = std::chrono::high_resolution_clock::now();
  }

  void stop() {
    const auto now = std::chrono::high_resolution_clock::now();
    const auto time_span = std::chrono::duration_cast<std::chrono::duration<double>>(now - start_time);
    total += time_span.count();
  }
};

//Simple class to enable directed rounding in floating-point math and to reset
//the rounding mode afterwards, when it goes out of scope
struct SetRoundingMode {
  const int old_rounding_mode;

  SetRoundingMode(const int mode) : old_rounding_mode(fegetround()) {
    if(std::fesetround(mode)!=0){
      throw std::runtime_error("Failed to set directed rounding mode!");
    }
  }

  ~SetRoundingMode(){
    if(std::fesetround(old_rounding_mode)!=0){
      throw std::runtime_error("Failed to reset rounding mode to original value!");
    }
  }

  static std::string get_rounding_mode_string() {
    switch (fegetround()) {
      case FE_DOWNWARD:   return "downward";
      case FE_TONEAREST:  return "to-nearest";
      case FE_TOWARDZERO: return "toward-zero";
      case FE_UPWARD:     return "upward";
      default:            return "unknown";
    }
  }
};

// Used to make showing bitwise representations somewhat more intuitive
template<class T>
struct binrep {
  const T val;
  binrep(const T val0) : val(val0) {}
};

// Display the bitwise representation
template<class T>
std::ostream& operator<<(std::ostream& out, const binrep<T> a){
  const char* beg = reinterpret_cast<const char*>(&a.val);
  const char *const end = beg + sizeof(a.val);
  while(beg != end){
    out << std::bitset<CHAR_BIT>(*beg++);
    if(beg < end)
      out << ' ';
  }
  return out;
}



//Simple serial summation algorithm with an accumulation type we can specify
//to more fully explore its behaviour
template<class FloatType, class SimpleAccumType>
FloatType serial_simple_summation(const std::vector<FloatType> &vec){
  SimpleAccumType sum = 0;
  for(const auto &x: vec){
    sum += x;
  }
  return sum;
}

//Parallel variant of the simple summation algorithm above
template<class FloatType, class SimpleAccumType>
FloatType parallel_simple_summation(const std::vector<FloatType> &vec){
  SimpleAccumType sum = 0;
  #pragma omp parallel for default(none) reduction(+:sum) shared(vec)
  for(size_t i=0;i<vec.size();i++){
    sum += vec[i];
  }
  return sum;
}



//Kahan's compensated summation algorithm for accurately calculating sums of
//many numbers with O(1) error
template<class FloatType>
FloatType serial_kahan_summation(const std::vector<FloatType> &vec){
  FloatType sum = 0.0f;
  FloatType c = 0.0f;
  for (const auto &num: vec) {
    const auto y = num - c;
    const auto t = sum + y;
    c = (t - sum) - y;
    sum = t;
  }
  return sum;
}

//Parallel version of Kahan's compensated summation algorithm (could be improved
//by better accounting for the compsenation during the reduction phase)
template<class FloatType>
FloatType parallel_kahan_summation(const std::vector<FloatType> &vec){
  //Parallel phase
  std::vector<FloatType> sum(omp_get_max_threads(), 0);
  FloatType c = 0;
  #pragma omp parallel for default(none) firstprivate(c) shared(sum,vec)
  for (size_t i=0;i<vec.size();i++) {
    const auto tid = omp_get_thread_num();
    const auto y = vec[i] - c;
    const auto t = sum.at(tid) + y;
    c = (t - sum[tid]) - y;
    sum[tid] = t;
  }

  //Serial reduction phase

  //This could be more accurate if it took the remaining compensation values
  //from above into account
  FloatType total_sum = 0.0f;
  FloatType total_c = 0.0f;
  for(const auto &num: sum){
    const auto y = num - total_c;
    const auto t = total_sum + y;
    total_c = (t - total_sum) - y;
    total_sum = t;
  }

  return total_sum;
}



// Error-free vector transformation. Algorithm 4 from Demmel and Nguyen (2013)
template<class FloatType>
FloatType ExtractVectorNew2(
  const FloatType M,
  const typename std::vector<FloatType>::iterator &begin,
  const typename std::vector<FloatType>::iterator &end
){
  // Should use the directed rounding mode of the parent thread

  auto Mold = M;
  for(auto v=begin;v!=end;v++){
    auto Mnew = Mold + (*v);
    auto q = Mnew - Mold;
    (*v) -= q;
    Mold = Mnew;
  }

  //This is the exact sum of high order parts q_i
  //v is now the vector of low order parts r_i
  return Mold - M;
}

template<class FloatType>
FloatType mf_from_deltaf(const FloatType delta_f){
  const int power = std::ceil(std::log2(delta_f));
  return static_cast<FloatType>(3.0) * std::pow(2, power);
}

//Implements the error bound discussed near Equation 6 of
//Demmel and Nguyen (2013).
template<class FloatType>
bool is_error_bound_appropriate(const size_t N, const int k){
  const auto eps = std::numeric_limits<FloatType>::epsilon();
  const auto ratio = std::pow(N, k) * std::pow(eps, k-1);
  //If ratio << 1, then the conventional non-reproducible sum and the
  //deterministic sum will have error bounds of the same order. We arbitrarily
  //choose 1e-4 to represent this
  return ratio < 1e-3;
}

//Serial bitwise deterministic summation.
//Algorithm 8 from Demmel and Nguyen (2013).
template<class FloatType>
FloatType serial_bitwise_deterministic_summation(
  std::vector<FloatType> vec,  // Note that we're making a copy!
  const int k
){
  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();
  const auto n = vec.size();
  const auto adr = SetRoundingMode(ROUNDING_MODE);

  if(n==0){
    return 0;
  }

  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){
    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;
  }

  FloatType m = std::abs(vec.front());
  for(const auto &x: vec){
    m = std::max(m, std::abs(x));
  }

  FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);
  FloatType Mf = mf_from_deltaf(delta_f);

  std::vector<FloatType> Tf(k);
  for(int f=0;f<k-1;f++){
    Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin(), vec.end());
    delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);
    Mf = mf_from_deltaf(delta_f);
  }

  FloatType M = Mf;
  for(const FloatType &v: vec){
    M += v;
  }
  Tf[k-1] = M - Mf;

  FloatType T = 0;
  for(const FloatType &tf: Tf){
    T += tf;
  }

  return T;
}

//Parallel bitwise deterministic summation.
//Algorithm 9 from Demmel and Nguyen (2013).
template<class FloatType>
FloatType parallel_bitwise_deterministic_summation(
  std::vector<FloatType> vec,  // Note that we're making a copy!
  const int k
){
  constexpr FloatType eps = std::numeric_limits<FloatType>::epsilon();
  const auto n = vec.size();
  const auto adr = SetRoundingMode(ROUNDING_MODE);

  if(n==0){
    return 0;
  }

  if(!is_error_bound_appropriate<FloatType>(vec.size(), k)){
    std::cout<<"WARNING! Error bounds of deterministic sum are large relative to conventional summation!"<<std::endl;
  }

  std::vector<FloatType> Tf(k);

  // Note that this reduction would require communication if we had
  // implemented this to work on multiple nodes
  FloatType m = std::abs(vec.front());
  #pragma omp parallel for default(none) reduction(max:m) shared(vec)
  for(size_t i=0;i<vec.size();i++){
    m = std::max(m, std::abs(vec[i]));
  }

  // Note that this reduction would require communication if we had
  // implemented this to work on multiple nodes
  #pragma omp declare reduction(vec_plus : std::vector<FloatType> : \
    std::transform(omp_out.begin(), omp_out.end(), omp_in.begin(), omp_out.begin(), std::plus<FloatType>())) \
    initializer(omp_priv = decltype(omp_orig)(omp_orig.size()))

  #pragma omp parallel default(none) reduction(vec_plus:Tf) shared(k,m,n,vec,std::cout)
  {
    const auto adr = SetRoundingMode(ROUNDING_MODE);
    const auto threads = omp_get_num_threads();
    const auto tid = omp_get_thread_num();
    const auto values_per_thread = n / threads;
    const auto nlow = tid * values_per_thread;
    const auto nhigh = (tid<threads-1) ? ((tid+1) * values_per_thread) : n;

    FloatType delta_f = n * m / (1 - 4 * (n + 1) * eps);
    FloatType Mf = mf_from_deltaf(delta_f);

    for(int f=0;f<k-1;f++){
      Tf[f] = ExtractVectorNew2<FloatType>(Mf, vec.begin() + nlow, vec.begin() + nhigh);
      delta_f = n * (4 * eps * Mf / 3) / (1 - 4 * (n + 1) * eps);
      Mf = mf_from_deltaf(delta_f);
    }

    FloatType M = Mf;
    for(size_t i=nlow;i<nhigh;i++){
      M += vec[i];
    }
    Tf[k-1] = M - Mf;
  }

  FloatType T = 0;
  for(const FloatType &tf: Tf){
    T += tf;
  }

  return T;
}



//Convenience wrappers
template<bool Parallel, class FloatType>
FloatType bitwise_deterministic_summation(
  const std::vector<FloatType> &vec,  // Note that we're making a copy!
  const int k
){
  if(Parallel){
    return parallel_bitwise_deterministic_summation<FloatType>(vec, k);
  } else {
    return serial_bitwise_deterministic_summation<FloatType>(vec, k);
  }
}

template<bool Parallel, class FloatType, class SimpleAccumType>
FloatType simple_summation(const std::vector<FloatType> &vec){
  if(Parallel){
    return parallel_simple_summation<FloatType, SimpleAccumType>(vec);
  } else {
    return serial_simple_summation<FloatType, SimpleAccumType>(vec);
  }
}

template<bool Parallel, class FloatType>
FloatType kahan_summation(const std::vector<FloatType> &vec){
  if(Parallel){
    return serial_kahan_summation<FloatType>(vec);
  } else {
    return parallel_kahan_summation<FloatType>(vec);
  }
}



// Timing tests for the summation algorithms
template<bool Parallel, class FloatType, class SimpleAccumType>
FloatType PerformTestsOnData(
  const int TESTS,
  std::vector<FloatType> floats, //Make a copy so we use the same data for each test
  std::mt19937 gen               //Make a copy so we use the same data for each test
){
  Timer time_deterministic;
  Timer time_kahan;
  Timer time_simple;

  //Very precise output
  std::cout.precision(std::numeric_limits<FloatType>::max_digits10);
  std::cout<<std::fixed;

  std::cout<<"Parallel? "<<Parallel<<std::endl;
  if(Parallel){
    std::cout<<"Max threads = "<<omp_get_max_threads()<<std::endl;
  }
  std::cout<<"Floating type                        = "<<typeid(FloatType).name()<<std::endl;
  std::cout<<"Floating type epsilon                = "<<std::numeric_limits<FloatType>::epsilon()<<std::endl;
  std::cout<<"Simple summation accumulation type   = "<<typeid(SimpleAccumType).name()<<std::endl;
  std::cout<<"Number of tests                      = "<<TESTS<<std::endl;
  std::cout<<"Input sample = "<<std::endl;
  for(size_t i=0;i<10;i++){
    std::cout<<"\t"<<floats[i]<<std::endl;
  }

  //Get a reference value
  std::unordered_map<FloatType, uint32_t> simple_sums;
  std::unordered_map<FloatType, uint32_t> kahan_sums;
  const auto ref_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);
  for(int test=0;test<TESTS;test++){
    std::shuffle(floats.begin(), floats.end(), gen);

    time_deterministic.start();
    const auto my_val = bitwise_deterministic_summation<Parallel, FloatType>(floats, 2);
    time_deterministic.stop();
    if(ref_val!=my_val){
      std::cout<<"ERROR: UNEQUAL VALUES ON TEST #"<<test<<"!"<<std::endl;
      std::cout<<"Reference      = "<<ref_val                   <<std::endl;
      std::cout<<"Current        = "<<my_val                    <<std::endl;
      std::cout<<"Reference bits = "<<binrep<FloatType>(ref_val)<<std::endl;
      std::cout<<"Current   bits = "<<binrep<FloatType>(my_val) <<std::endl;
      throw std::runtime_error("Values were not equal!");
    }

    time_kahan.start();
    const auto kahan_sum = kahan_summation<Parallel, FloatType>(floats);
    kahan_sums[kahan_sum]++;
    time_kahan.stop();

    time_simple.start();
    const auto simple_sum = simple_summation<Parallel, FloatType, SimpleAccumType>(floats);
    simple_sums[simple_sum]++;
    time_simple.stop();
  }

  std::cout<<"Average deterministic summation time = "<<(time_deterministic.total/TESTS)<<std::endl;
  std::cout<<"Average simple summation time        = "<<(time_simple.total/TESTS)<<std::endl;
  std::cout<<"Average Kahan summation time         = "<<(time_kahan.total/TESTS)<<std::endl;
  std::cout<<"Ratio Deterministic to Simple        = "<<(time_deterministic.total/time_simple.total)<<std::endl;
  std::cout<<"Ratio Deterministic to Kahan         = "<<(time_deterministic.total/time_kahan.total)<<std::endl;

  std::cout<<"Reference value                      = "<<std::fixed<<ref_val<<std::endl;
  std::cout<<"Reference bits                       = "<<binrep<FloatType>(ref_val)<<std::endl;

  std::cout<<"Distinct Kahan values                = "<<kahan_sums.size()<<std::endl;
  std::cout<<"Distinct Simple values               = "<<simple_sums.size()<<std::endl;

  int count = 0;
  for(const auto &kv: kahan_sums){
    std::cout<<"\tKahan sum values (N="<<std::fixed<<kv.second<<") "<<kv.first<<" ("<<binrep<FloatType>(kv.first)<<")"<<std::endl;
    if(count++==10){
      break;
    }
  }

  count = 0;
  for(const auto &kv: simple_sums){
    std::cout<<"\tSimple sum values (N="<<std::fixed<<kv.second<<") "<<kv.first<<" ("<<binrep<FloatType>(kv.first)<<")"<<std::endl;
    if(count++==10){
      break;
    }
  }

  std::cout<<std::endl;

  return ref_val;
}

// Use this to make sure the tests are reproducible
template<class FloatType, class SimpleAccumType>
void PerformTests(
  const int TESTS,
  const std::vector<long double> &long_floats,
  std::mt19937 &gen
){
  std::vector<FloatType> floats(long_floats.begin(), long_floats.end());

  const auto serial_val = PerformTestsOnData<false, FloatType, SimpleAccumType>(TESTS, floats, gen);
  const auto parallel_val = PerformTestsOnData<true, FloatType, SimpleAccumType>(TESTS, floats, gen);

  //Note that the `long double` type may only use 12-16 bytes (to maintain
  //alignment), but only 80 bits, resulting in bitwise indeterminism in the last
  //few bits; however, the floating-point values themselves will be equal.
  std::cout<<"########################################"<<std::endl;
  std::cout<<"### Serial and Parallel values match for "
           <<typeid(FloatType).name()
           <<"? "
           <<(serial_val==parallel_val)
           <<std::endl;
  std::cout<<"########################################\n"<<std::endl;
}



int main(){
  std::random_device rd;
  // std::mt19937 gen(rd());   //Enable for randomness
  std::mt19937 gen(123456789); //Enable for reproducibility
  std::uniform_real_distribution<long double> distr(-1000, 1000);
  std::vector<long double> long_floats;
  for(int i=0;i<N;i++){
    long_floats.push_back(distr(gen));
  }

  PerformTests<double, double>(TESTS, long_floats, gen);
  PerformTests<long double, long double>(TESTS, long_floats, gen);
  PerformTests<float, float>(TESTS, long_floats, gen);
  PerformTests<float, double>(TESTS, long_floats, gen);

  return 0;
}

为 Eric Postpischil 编辑

埃里克说:

A generator such as I described will produce numbers with largely the same quantum—all near multiples of the divisor. This may not include a wide variety of exponents in the samples, so it may not well model a distribution where the numbers are being rounded in the interior of the significand when being added. E.g., if we add many numbers of the form 123.45, they will add fine for a while, although rounding errors will grow as the partial sums accumulate. But if we add numbers of forms 12345, 123.45, and 1.2345, different errors occur earlier

将 123.45 的 1M 值相加得到 123450000.000000002837623469532 的单个 long double Kahan 值。确定性 long double 值与此不同 -0.00000000000727595761,而确定性 double 值与 -0.00000001206353772047 不同,而确定性 float 与 -3.68% 不同(正如预期的那样,给定其较大的 epsilon)。

从集合 {1.2345, 12.345, 123.45, 1234.5, 12345} 中选择 1M 个值 运行domly 给出两个 long double Kahan 值:A=2749592287.563000000780448317528 (N=54) 和 B=2749592287.563000000547617673874 (N=46) .确定性 long double 值与 (A) 完全匹配;确定性双精度值与 (A) 的不同之处在于 -0.00000020139850676247;确定性浮点值与 (A) 的不同之处在于 -257%(同样是预期的)。

TL/DR:有一种简单而系统的方法,尽管它不是一种有效的方法:将所有内容转换为定点并获得精确值。由于只有最后一步转回float包含四舍五入,所以整个过程与被加数的顺序无关

这个想法取自The Exact Dot Product as Basic 长区间运算工具.

观察到所有浮点数也可以表示为适当宽度的定点数。如果添加一些额外的整数位来解决溢出问题,则可以精确表示定点数的乘法和加法。

例如,一个 64 位二进制 IEE754 浮点数可以表示为 2131 位定点数,小数点前有 1056 位,小数点后有 1075 位(参见 Table 1 in link 其他 FP格式,但请注意,引用的数字是两个浮点数的乘积的 2 倍太高了)。

为溢出添加 64 位(假设一次最多可以添加 2^64 个数字)我们得到 2195 位或 275 字节的定点宽度。就 64 位无符号整数而言,这是小数点前 18 位和小数点后 17 位。

float32只需要小数点前3位和小数点后3位,或者,如果一个肢体之间允许一个小数点,总共只需要5位。

这种算法在许多语言中都可以直接实现。通常不需要动态内存管理,并且当可以可靠地检测到无符号整数溢出时(例如在 C 和 C++ 中),不需要汇编语言来高效实现。

该原理可以应用于需要获得正确舍入的更复杂的浮点数问题。它不是很有效但易于实现并且相当稳健(在可测试性的意义上)。

编辑:更正了数字,因为它们是 2 的因数太高了。还添加了 float32 大小写(来自 njuffa 的建议)

概述:单程数据保存方法

是的!

Ahrens、Demmel 和 Nguyen(2020 年)的论文“Algorithms for Efficient Reproducible Floating Point Summation”中描述了执行此操作的方法。我在下面包含了它的一个实现。

这里我们将比较三种算法:

  1. 传统的从左到右求和:相对于输入顺序的排列,这是快速、不准确且不可重现的。
  2. Kahan summation:速度较慢,非常准确(输入大小基本上为 O(1)),并且,虽然在排列方面不可重现输入顺序,更接近狭义的再现性。
  3. 可重现求和:这比 Kahan 求和慢一些,可以非常准确,并且按位可重现。

传统的求和是不准确的,因为随着总和的增长,我们添加到它的数字的最低有效位会被悄无声息地丢弃。 Kahan 求和通过保持 运行ning 的“补偿”总和来克服这个问题,该总和保留在最低有效数字上。可重现的求和使用类似的策略来保持准确性,但保持与不同补偿水平相对应的多个箱。这些 bin 还适当地 运行 区分输入数据的重要性以获得准确、可重复的总和。

第一次测试

为了研究算法,我们将 运行 对 100 万个数字进行测试,每个数字均从 运行ge [-1000, 1000] 中统一抽取。为了证明算法的 运行ge 答案及其可重复性,输入将被打乱并求和 100 次。

我要断言的结果(下面的代码严格证明了这一点)是,对于所研究的每种数据类型,确定性算法对 100 种数据排序中的每一种都产生相同的结果。

我的实现包括一个 class,其中传递给它的可重现累加器值,但是有几种方法可以传递输入数据。我们将对其中三个进行试验:

  • 1ata 测试一种模式,在这种模式下,每次将一个数字传递给累加器,而无需事先了解输入的最大幅度
  • many 测试将许多数字传递给累加器但事先不知道输入的最大幅度的模式
  • manyc 测试一种模式,在这种模式下,许多数字被传递到累加器,并且我们事先知道输入的最大幅度

各种算法的结果如下:

Single-Precision Floats
==================================================================
Average Reproducible sum 1ata time   = 0.0115s (11.5x simple; 2.6x Kahan)
Average Reproducible sum many time   = 0.0047s (4.7x simple; 1.1x Kahan)
Average Reproducible sum manyc time  = 0.0036s (3.6x simple; 0.8x Kahan)
Average Kahan summation time         = 0.0044s
Average simple summation time        = 0.0010s
Reproducible Value                   = 767376.500000000 (0% diff vs Kahan)
Kahan long double accumulator value  = 767376.500000000
Error bound on RV                    = 15.542840004
Distinct Kahan (single accum) values = 1
Distinct Simple values               = 92


Double-Precision Floats
==================================================================
Average Reproducible sum 1ata time   = 0.0112s (10.5x simple; 2.6x Kahan)
Average Reproducible sum many time   = 0.0051s (4.7x simple; 1.2x Kahan)
Average Reproducible sum manyc time  = 0.0037s (3.4x simple; 0.8x Kahan)
Average simple summation time        = 0.0011s
Average Kahan summation time         = 0.0044s
Reproducible Value                   = 310844.70069914369378239 (0% diff vs Kahan)
Kahan long double accumulator value  = 310844.70069914369378239
Error bound on RV                    = 0.00000000048315059
Distinct Kahan (double accum) values = 2
Distinct Simple values               = 96

第二次测试

让我们尝试比制服更具挑战性的测试 运行dom!相反,让我们在 运行ge [0, 2π) 中生成 100 万个数据点,并使用它们生成正弦波的 y 值。和以前一样,然后我们将这些数据点打乱 100 次,看看我们得到什么样的输出。时序值与上面的非常相似,所以我们将它们排除在外。这给了我们:

Single-Precision Floats
==================================================================
Reproducible Value                   = 0.000000000
Kahan long double accumulator value  = -0.000000000
Error bound                          = 14.901161194
Distinct Kahan (single) values       = 96
Distinct Simple values               = 100
Kahan lower value                    = -0.000024229
Kahan upper value                    = 0.000025988
Simple lower value                   = -0.017674208
Simple upper value                   = 0.018800795
Kahan range                          = 0.000050217
Simple range                         = 0.036475003
Simple range / Kahan range           = 726

Double-Precision Floats
==================================================================
Reproducible Value                   = 0.00000000000002185
Kahan long double accumulator value  = 0.00000000000002185
Error bound                          = 0.00000000000000083
Distinct Kahan (double) values       = 93
Distinct Simple values               = 100
Kahan lower value                    = 0.00000000000000017
Kahan upper value                    = 0.00000000000006306
Simple lower value                   = -0.00000000004814216
Simple upper value                   = 0.00000000002462563
Kahan range                          = 0.00000000000006289
Simple range                         = 0.00000000007276779
Simple range / Kahan range           = 1157

在这里我们看到可重现值与我们用作真实来源的 long-double Kahan 求和值完全匹配。

与我们之前的结果相比,更简单的测试有许多不同的 Kahan 求和值(其中 Kahan 求和使用与数据相同的浮点类型);使用更困难的数据集已经破坏了我们在 Kahan 求和中观察到的更简单测试中的“部分确定性”,即使 Kahan 求和产生的 运行ge 值比通过简单求和获得的值小几个数量级。

因此,在这种情况下,可重现求和比 Kahan 求和具有更高的准确性,并产生单个按位可重现的结果,而不是大约 100 个不同的结果。

费用

时间。 可重现的求和比简单求和花费的时间长约 3.5 倍,与 Kahan 求和的时间大致相同。

内存。 使用 3 倍合并需要累加器 6*sizeof(floating_type) space。即单精度24字节,双精度48字节

此外,需要静态查找 table,它可以在给定数据类型和折叠的所有累加器之间共享。对于 3 倍分箱,这个 table 是 41 个浮点数(164 字节)或 103 个双精度数(824 字节)。

讨论

那么,我们学到了什么?

标准求和给出了 运行ge 个结果。我们 运行 进行了 100 次测试,并获得了 96 个双精度类型的独特结果。这显然是不可重现的,并且很好地证明了我们正在努力解决的问题。

另一方面,Kahan 求和产生的不同结果非常少(因此它“更具”确定性)并且大约是传统求和的 4 倍。

如果我们对要添加的数字一无所知,则可重现算法比简单求和的时间长约 10 倍;但是,如果我们知道被加数的最大幅度,那么可重现算法只需要比简单求和长约 3.5 倍的时间。

与 Kahan 求和相比,可重现算法有时需要 更少 时间 (!),同时确保我们得到按位相同的答案。此外,与 中详述的方法不同,可重现的结果与我们简单测试中单精度和双精度的 Kahan 求和结果相匹配。我们还得到了强有力的保证运行证明可重现和 Kahan 结果的相对误差非常低。

什么折叠级别最好? 3,根据下图。

Fold-1 非常糟糕。 对于此测试,2 为双精度提供了良好的精度,但为浮点提供了 10% 的相对误差。 Fold-3 为 double 和 float 提供了良好的准确性,但时间增加很小。因此,我们只希望 fold-2 一次一次可重复地对双打求和。高于 Fold-3 只会带来边际收益。

代码

完整代码太长,无法包含在此处;但是,您可以找到 C++ 版本 here; and ReproBLAS here.