前循环迭代对当前迭代执行时间的影响

Question

我正在尝试测量 folly hashmap 中并发插入的性能。这里提供了用于此类插入的程序的简化版本：

#include <folly/concurrency/ConcurrentHashMap.h>
#include <chrono>
#include <iostream>
#include <mutex>
#include <thread>
#include <vector>

const int kNumMutexLocks = 2003;
std::unique_ptr<std::mutex[]> mutices(new std::mutex[kNumMutexLocks]);
__inline__ void
concurrentInsertion(unsigned int threadId, unsigned int numInsertionsPerThread,
                    unsigned int numInsertions, unsigned int numUniqueKeys,
                    folly::ConcurrentHashMap<int, int> &follyMap) {
  int base = threadId * numInsertionsPerThread;
  for (int i = 0; i < numInsertionsPerThread; i++) {
    int idx = base + i;
    if (idx >= numInsertions)
      break;
    int val = idx;
    int key = val % numUniqueKeys;
    mutices[key % kNumMutexLocks].lock();
    auto found = follyMap.find(key);
    if (found != follyMap.end()) {
      int oldVal = found->second;
      if (oldVal < val) {
        follyMap.assign(key, val);
      }
    } else {
      follyMap.insert(key, val);
    }
    mutices[key % kNumMutexLocks].unlock();
  }
}

void func(unsigned int numInsertions, float keyValRatio) {
  const unsigned int numThreads = 12; // Simplified just for this post
  unsigned int numUniqueKeys = numInsertions * keyValRatio;
  unsigned int numInsertionsPerThread = ceil(numInsertions * 1.0 / numThreads);
  std::vector<std::thread> insertionThreads;
  insertionThreads.reserve(numThreads);
  folly::ConcurrentHashMap<int, int> follyMap;

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < numThreads; i++) {
    insertionThreads.emplace_back(std::thread([&, i] {
      concurrentInsertion(i, numInsertionsPerThread, numInsertions,
                          numUniqueKeys, follyMap);
    }));
  }
  for (int i = 0; i < numThreads; i++) {
    insertionThreads[i].join();
  }
  auto end = std::chrono::steady_clock::now();

  auto diff = end - start;
  float insertionTimeMs =
      std::chrono::duration<double, std::milli>(diff).count();
  std::cout << "i: " << numInsertions << "\tj: " << keyValRatio
            << "\ttime: " << insertionTimeMs << std::endl;
}

int main() {
  std::vector<float> js = {0.5, 0.25};
  for (auto j : js) {
    std::cout << "-------------" << std::endl;
    for (int i = 2048; i < 4194304 * 8; i *= 2) {
      func(i, j);
    }
  }  
}

问题是在 main 中使用这个循环，突然增加了 func 函数中的测量时间。也就是说，如果我在没有任何循环的情况下直接从 main 调用该函数（如下所示），某些情况下的测量时间会突然减少 100 倍以上。

int main() {
  func(2048, 0.25); // ~ 100X faster now that the loop is gone.
}

可能的原因

我在构建 hasmap 时分配了大量内存。我相信当我运行循环中的代码时，当正在执行循环的第二次迭代时，计算机正忙于为第一次迭代释放内存。因此，程序变得更慢。如果是这种情况，如果有人可以建议我可以通过循环获得相同结果的更改，我将不胜感激。

更多详情

请注意，如果我在 main 中展开循环，我也会遇到同样的问题。即下面的程序有同样的问题：

int main() {
  performComputation(input A);
  ...
  performComputation(input Z);
}

示例输出

第一个程序的输出如下所示：

i: 2048     j: 0.5  time: 1.39932
...
i: 16777216 j: 0.5  time: 3704.33
------------- 
i: 2048     j: 0.25 time: 277.427 <= sudden increase in execution time
i: 4096     j: 0.25 time: 157.236
i: 8192     j: 0.25 time: 50.7963
i: 16384    j: 0.25 time: 133.151
i: 32768    j: 0.25 time: 8.75953
...
i: 2048     j: 0.25 time: 162.663

运行 func 单独在 main 中与 i=2048 和 j=0.25 产生：

i: 2048     j: 0.25 time: 1.01

任何 comment/insight 非常感谢。

Answer 1

如果是内存分配导致速度变慢，并且 performComputation(input) 之前的内存内容无关紧要，您可以 re-use 分配的内存块。

int performComputation(input, std::vector<char>& memory) { 

  /* Note: memory will need to be passed by reference*/

  auto start = std::chrono::steady_clock::now();
  for (int i = 0; i < numThreads; i++) {
    t.emplace_back(std::thread([&, i] {
      func(...); // Random access to memory
    }));
  }

  for (int i = 0; i < numThreads; i++) {
    t[i].join();
  }

  auto end = std::chrono::steady_clock::now();
  float time = std::chrono::duration<double, std::milli>(end - start).count();

}

int main() {

  // A. Allocate ~1GB memory here
  std::vector<char> memory(1028 * 1028 * 1028) //is that 1 gig?

  for (input: inputs)
    performComputation(input, memory);
}

Answer 2

测量挂钟时间时使用平均值！

您正在测量挂钟时间。在这方面，实际看到的 jumps 时间有点小，理论上可能会导致 OS 延迟或其他处理，或者可能由于线程管理而变得更糟（例如清理）由您的程序引起（请注意，这可能会因 platform/system 而有很大差异，并且请记住，上下文切换很容易花费 ~10-15 毫秒）有太多参数在起作用，无法确定。使用挂钟进行测量时，通常的做法是对数百或数千次的循环进行平均，以考虑 spikes/etc...

使用分析器

学习使用探查器 - 探查器可以帮助您快速了解您的程序实际花费时间的地方，并一次又一次地节省宝贵的时间。

Answer 3

我对具体细节不太确定，但在我看来这是构建地图时分配内存的结果。我使用普通 unordered_map 和单个 mutex 复制了您看到的行为，并使 func static 中的地图对象完全修复了它。（实际上现在第一次稍微慢一些，因为还没有为地图分配内存，然后在随后的每个运行中更快且时间一致。）

我不确定为什么这会有所不同，因为地图已被破坏并且内存应该已被释放。出于某种原因，地图释放的内存似乎没有在后续调用 func 时重新使用。也许其他人比我更了解这一点。

编辑：减少最小的、可重现的示例和输出

void func(int num_insertions)
{
    const auto start = std::chrono::steady_clock::now();

    std::unordered_map<int, int> map;
    for (int i = 0; i < num_insertions; ++i)
    {
        map.emplace(i, i);
    }

    const auto end = std::chrono::steady_clock::now();
    const auto diff = end - start;

    const auto time = std::chrono::duration<double, std::milli>(diff).count();
    std::cout << "i: " << num_insertions << "\ttime: " << time << "\n";
}

int main()
{
    func(2048);
    func(16777216);
    func(2048);
}

使用非静态地图：

i: 2048 time: 0.6035
i: 16777216     time: 4629.03
i: 2048 time: 124.44

有静态地图：

i: 2048 time: 0.6524
i: 16777216     time: 4828.6
i: 2048 time: 0.3802

另一个编辑：还应该提到静态版本还需要在最后调用 map.clear()，尽管这与插入的性能问题并不真正相关。

前循环迭代对当前迭代执行时间的影响

Impact of the prior loop iteration on the execution time of the current iteration

c++

time

stl

c++11

folly

测量挂钟时间时使用平均值！

使用分析器