为什么这个 C++ 双端队列代码在我解模时是 'less efficient'？

Question

这是我在研究this problem on HackerRank的解决方案时遇到的问题。它基本上可以归结为以下内容：给定一个数组 A 和一个整数 K，该问题要求您找到大小为 K 的 A 的每个连续子数组的最大值。还有一些额外的东西（对于每个测试用例，这个问题会出现 T 次，并且必须从输入中读取才能获得 A），但这就是它的要点。

本站会检查您的回答是否足够有效，即使您的代码最终产生了正确的输出，也可能会因超时而导致提交失败。现在，当您第一次进入代码区时，有一些预定义代码供您使用：

#include <iostream>
#include <deque> 
using namespace std;

void printKMax(int arr[], int n, int k){
    //Write your code here.
}

int main(){

    int t;
    cin >> t;
    while(t>0) {
        int n,k;
        cin >> n >> k;
        int i;
        int arr[n];
        for(i=0;i<n;i++)
            cin >> arr[i];
        printKMax(arr, n, k);
        t--;
    }
    return 0;
}

一切都很好，该站点已经为您的利益处理了输入，并为您指明了模块化方法。您所要做的就是实际解决连续子数组问题；我用以下函数完成了它，它通过了所有测试用例：

void printKMax(int arr[], int n, int k){
    // Queue will store up to k elements (moving window)
    // Elements stored will be indices of the array. Indices allow to compare with loop iterator for removal of oldest element as window moves
    // Most recent elements are pushed to the back (so they take longer to be popped)
    std::deque<int> Q(k); 

    // Iterate over the first window
    int i; 
    for (i = 0; i < k; ++i) { 
        // A new element (arr[i]) to be added at the back invalidates all elements added before it that are no greater (if they are smaller, they cannot be a maximum; if they are equal, the new element is more recent). Those elements are popped
        // This guarantees that the values corresponding to Q's indices will be increasing (strictly) from back to front. In particular, the front will be the maximum
        while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
            Q.pop_back();
        Q.push_back(i); 
    }

    // Iterate, letting the window move
    for (; i < n; ++i) { 
        // Print the largest element (window has not moved yet), followed by a space
        cout << arr[Q.front()] << " "; 

        // Use indices to determine if oldest element has gone out of the window
        if (Q.front() <= i - k) Q.pop_front();

        // Add new element, like before
        while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
            Q.pop_back(); 
        Q.push_back(i); 
    } 

    // Print the maximum element of last window, and end the line 
    cout << arr[Q.front()] << endl; 
}

然而，回顾过去，我认为我可以（稍微？）做得更好。对于每个测试用例，主循环中的输入处理循环 n 次以填充数组 arr，然后 printKMax 依次循环 n 次以创建和滑动移动 window。我想我可以将它们组合成一个循环，我想这应该更有效率。我用下面的代码做到了，它与以前基本相同，但是 printKMax 将 合并到 中。这次我会省略评论。

int main(){

    int t;
    cin >> t;
    while(t>0) {
        int i = 0, n, k;
        cin >> n >> k;
        int arr[n];

        std::deque<int> Q(k);

        for (; i < k; ++i) {
            cin >> arr[i];
            while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
                Q.pop_back();
            Q.push_back(i); 
        }

        for (; i < n; ++i) {
            cout << arr[Q.front()] << " "; 

            if (Q.front() <= i - k) Q.pop_front();

            cin >> arr[i];
            while ((!Q.empty()) && arr[i] >= arr[Q.back()]) 
                Q.pop_back(); 
            Q.push_back(i); 
        } 

        cout << arr[Q.front()] << endl;

        t--;
        }
    return 0;
}

然而，这一次，代码 失败了 除了最简单的测试用例之外，所有测试用例都由于时间限制而失败。

我知道在实践中模块化很好，但在这一点上我的兴趣比其他任何东西都重要 'academic'。关于实际程序我是否遗漏了什么，或者这是否与我不知道的运行幕后情况有关？

编辑： 正如 PaulMcKenzie 所建议的，我使用 g++ 运行程序的两个版本，使用失败的测试用例之一作为输入。运行都很好并产生了相同的输出，但是带有 printKMax 运行的输出为 3.97 秒，而没有运行的输出为 5.44 秒。大约需要 37% 的时间。

Answer 1

"de-modularized" 版本运行缓慢的一个原因是循环中有更多 I/O。来自 C++ 标准：

Reading an object designated by a volatile glvalue (6.10), modifying an object, calling a library I/O function, or calling a function that does any of those operations are all side effects, which are changes in the state of the execution environment.

调用 I/O 函数后，代码必须从内存中重新加载对象（除非对象根本不需要存储在内存中，例如地址没有泄露的局部变量）。请注意，任何在编译时定义不可用的函数都被视为调用 I/O 函数。

为了提高反模块化版本的性能，请不要在其中做任何I/O。而是将结果存储到另一个数组 int results[k]; 并稍后打印该数组。

另一个问题是默认情况下 std::cin 和 std::cout 在 sync_with_stdio:

In practice, this means that the synchronized C++ streams are unbuffered, and each I/O operation on a C++ stream is immediately applied to the corresponding C stream's buffer.

尝试调用 std::cin.sync_with_stdio(false); 和 std::cout.sync_with_stdio(false); - 这会使 C++ I/O 更快。

Answer 2

你的第二个版本比较慢，因为你交错输入和输出。

从 cin 读取会导致首先刷新为 cout 缓冲的任何输出。这个想法是有一个用户需要看到你的提示。

由于每次测试在读和写之间切换多次次，因此会导致许多I/O次操作写入控制台，这不是那么快。

参见：https://www.geeksforgeeks.org/fast-io-for-competitive-programming/ and https://en.cppreference.com/w/cpp/io/basic_ios/tie

你可以用cin.tie(NULL);

修复它

为什么这个 C++ 双端队列代码在我解模时是 'less efficient'？

Why is this C++ deque code 'less efficient' when I demodularize it?

c++

algorithm

performance

deque

stddeque