查找子字符串出现的次数

Question

我有个小问题。我正在解决一项编程任务，但遇到了问题。这是一个简单的，但时间限制使它变得有点困难。

Find number of occurrences of substring. You will be given M - length of substring; substring to find, N - length of base string; base string.
M <= 100 000
N<= 200 000

Input

10
budsvabbud
79
uaahskuskamikrofonubudsvabbudnebudlabutkspkspkspmusimriesitbudsvabbudsvabbudnel

Output
3

我尝试使用内置函数 find，但速度不够快：

#include<iostream>
#include<string>

using namespace std;

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    cin >> n >> to_find >> n >> base_string;
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    cout << occurrences << endl;
}

所以我尝试自己写函数，但是更慢：

#include<iostream>
#include<cstdio>
#include<string>
#include<queue>

using namespace std;

int main()
{
    int n, m;
    string to_find;
    queue<int> rada;  
    int occurrences = 0;
    cin >> m >> to_find >> n;
    for (int i = 0; i < n; i++)
    {
        char c;
        scanf(" %c", &c);
        int max = rada.size();
        for (int j = 0; j < max; j++)
        {
            int index = rada.front();
            rada.pop();
            if (c == to_find[index])  
            {
                if (++index == m) {
                    occurrences++;
                }
                else
                    rada.push(index);
            }
        }
        if (c == to_find[0])
        {
            if (1 == m)
                n++;
            else
                rada.push(1);
        }
    }
    cout << occurrences << endl;

}

我知道有些人在 0 毫秒内完成了此操作，但我的第一个代码需要超过 2000 毫秒，而第二个代码需要更多。你有什么想法如何解决这个问题吗？谢谢

编辑：长度限制：

M <= 100 000 - 子字符串的长度

N<= 200 000 - 基本字符串的长度

Answer 1

我在没有任何优化的情况下在调试模式下尝试此代码，它花费了 11 毫秒。 VS.NET 2013 年，英特尔酷睿 i7：

int main()
{
    int n;
    int occurrences = 0;
    string::size_type start = 0;
    string base_string, to_find;
    base_string.reserve(200000);
    to_find.reserve(100000);
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('a');
    }
    for (size_t i = 0; i < 100000; i++){
        base_string.push_back('b');
    }
    for (size_t i = 0; i < 100000; i++){
        to_find.push_back('b');
    }
    auto start_s = clock();
    while ((start = base_string.find(to_find, start)) != string::npos) {
        ++occurrences;
        start++;; // see the note
    }
    auto stop_s = clock();
    std::cout << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000;
    cout << occurrences << endl;
    std::getchar();
}

编译器、配置、您的机器有问题，但在您的代码中。

Answer 2

安全版

static size_t findOccurences(const char * const aInput, const char * const aDelim)
{
    if (aInput == 0x0 || aDelim == 0x0)
    {
        throw std::runtime_error("Argument(s) null");
    }

    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;

    if (delimLength <= inputLength && delimLength > 0)
    {
        size_t delimIndex = 0;

        for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
        {
            if (aInput[inputIndex] != aDelim[delimIndex])
            {
                delimIndex = 0;
            }
            else
            {
                delimIndex++;

                if (delimIndex == delimLength)
                {
                    delimIndex = 0;
                    result++;
                }
            }
        }
    }

    return result;
}

不安全版本

static size_t unsafeFindOccurences(const char * const aInput, const char * const aDelim)
{
    const size_t inputLength = strlen(aInput);
    const size_t delimLength = strlen(aDelim);

    size_t result = 0;
    size_t delimIndex = 0;

    for (size_t inputIndex = 0; inputIndex < inputLength; inputIndex++)
    {
        if (aInput[inputIndex] != aDelim[delimIndex])
        {
            delimIndex = 0;
        }
        else
        {
            delimIndex++;

            if (delimIndex == delimLength)
            {
                delimIndex = 0;
                result++;
            }
        }
    }

    return result;
}

结果安全

          x86        x64
Debug     5501ms     5813ms
Release   3889ms     3998ms

结果不安全

          x86        x64
Debug     5442ms     5564ms
Release   3074ms     3139ms

使用 Visual Studio 2015、Visual Studio 2015 (v140) 工具集在 Windows 10 x64 Pro 下编译。

使用 this 输入。正在搜索 'ad' 和 1.000.000 次迭代。

Answer 3

您提供的算法是 O(M*N)，其中 N 是文本的长度，M 是搜索到的世界的长度。通常，库也会实现朴素算法。然而，Knuth、Morrison 和 Pratt 有一个算法，它在 O(M+N) 时间内完成。参见，例如，维基百科 Knuth-Morrison-Pratt Algorithm. It has some variations which might be easier to implement like Boyer-Moore-Horsepool.

查找子字符串出现的次数

Find number of occurrences of substring

c++

string

performance

find

find-occurrences

安全版

不安全版本

结果安全

结果不安全