将字符串拆分为等长子字符串的更快方法

Question

我想创建一个函数来将一个字符串拆分成等长的子字符串 n 个字符和 return 个字符向量。

例如F('atgctgttg',n=5) 应该 return

'atgct','tgctg','gctgt','ctgtt','tgttg'

我尝试了两种不同的功能：

// [[Rcpp::export]]
CharacterVector f( const std::string str, const int n ) {
    int lim = str.length() - n + 1;
    CharacterVector result( lim );
    for ( int j = 0; j < lim; j++ )
    { 
        result[j] = str.substr( j, n );
    }
    return result;
}

和

// [[Rcpp::export]]
CharacterVector f1( const std::string str, const int n ) {
    const int lim = str.length();
    const int n1 = n - 1;
    CharacterVector result( lim - n1 );
    int j = 1;
    std::string tmp = str.substr( 0, n );
    result[0] = tmp;

    for ( int i = n; i < lim; i++ )
    {
        tmp.erase( 0, 1 );
        tmp.push_back( str[i] );
        result[j] = tmp;
        j++;
    }
    return result;
}

我也尝试过使用迭代器，但它并不比函数 f1 快。请注意，Rcpp 将输入转换为参考变量。有没有更快的方法来做到这一点？

Answer 1

首先，你的函数签名有问题：

CharacterVector f( const std::string str, const int n )

您正在按值传递 string，在函数的每次调用中都会有一个字符串副本（除非您使用 C++11 传递可移动字符串）。最好通过 const 引用传递字符串 const std::string& str.

关于这个问题，想到了两个可能的答案。

Return其实就是复制输入字符串的字符。在这种情况下，按索引迭代字符串并在代码示例 1 中的结构中插入新字符串应该很快（更快的可能只有 1 个副本，即子字符串到结构的副本）。
Return 指向真实字符串的指针结构。例如：return 包含字符串中子字符串的 (start,end) 的代理对象。优点是它不是字符串的副本。例如：

代码（测试：GCC 4.9.2 with C++11）

#include <iostream>
#include <vector>

struct string_ref {
    const char* start;
    const char* end;
};

// [[Rcpp::export]]
std::vector<string_ref> f(std::string&&, const int) = delete; // disallow calls with temporaries
// [[Rcpp::export]]
std::vector<string_ref> f(const std::string& str, const int n) {
    int lim = str.length() - n + 1;
    std::vector<string_ref> result(lim);
    for (int j = 0; j < lim; j++) {
        result[j] = { &str[j], &str[j + n] };
    }
    return result;
}

int main() {
    std::string input{"atgctgttg"};
    auto result = f(input, 5);
    for (const auto r : result) {
        std::cout << std::string(r.start, r.end) << std::endl;
    }
    return 0;
}

许多解析文本的库（例如：词法分析器、正则表达式引擎等）都使用此方法。对于 C++17，建议使用类型 std::string_view，以引用部分或全部字符串字符。

根据代码中的注释，您正在实现要在 R 中使用的函数（不确切知道），在这种情况下，第二种解决方案可能会带来内存访问问题（输入字符串内存需要在使用子字符串指针时可访问且有效）。如果在 R 中创建输入字符串并调用 F，则 returning 指针可能有效，更好的证明是测试。

问题中的代码2个示例。第一个会更快，因为在每个循环的第二个中，有一个字符的擦除和 push_back （擦除第一个字符很可能需要在大多数 STL 实现中复制字符串的所有其他字符） , push_back 在某些情况下可能需要扩展字符串的内存。

Answer 2

我将使用的方法是创建一个指向字符串开头的迭代器和一个指向第一个子字符串的过去和结尾的迭代器。然后使用 std::vector 使用 emplace_back() 在作为子字符串的向量末尾构造一个字符串。然后递增两个迭代器，直到到达终点。

std::vector<std::string> splitString(const std::string& str, std::size_t len)
{
    if (len >= str.size())
        return { str };
    auto it = str.begin();
    auto end = it + len;
    std::vector<std::string> strings;
    while (end != str.end())
    {
        strings.emplace_back(it, end);
        ++end;
        ++it;
    }
    // have to do this to get the last string since end == str.end()
    strings.emplace_back(it, end);
    return strings;
}

Live Example

Answer 3

编译器会将您的 f 函数转换为最快的代码如果您更改为通过引用复制：CharacterVector f(const std::string& str, const int n)

虽然您看不到速度提升，但您绝对可以通过取消 CharacterVector 并仅使用 vector<string>:

来简化流程

const string str("atgctgttg");
const int n = 5; // Assumed positive number smaller than str.size()
const int n1 = n - 1;
vector<string> result(str.size() - n1);

transform(str.cbegin(), str.cend() - n1, result.begin(), [n](const auto& i) {return string(&i, n);});

[Live Example]

您可以看到速度改进的一种方法是，如果您可以使用 array 而不是 string:

const string str("atgctgttg");
const int n1 = N - 1;
vector<array<char, N>> result(str.size() - n1);

transform(str.cbegin(), str.cend() - n1, result.begin(), [](const auto& i) {
    array<char, N> result;

    copy_n(&i, N, result.begin());
    return result;
});

[Live Example]

但到目前为止，最快（也是最好）的方法是对原始 string 进行处理， 而不是 将其分解为 [=17] 的数组=]秒。这需要在后端做更多的工作，因为您需要使用 c-strings 而不是 std::strings。例如，我使用 for (auto& i : result) cout << string(i.data(), N) << endl; 来打印我所有的 vector，但是如果你没有使用 vector，你可以这样打印： for (auto i = str.cbegin(); i != str.cend() - n1; ++i) printf("%.*s\n", n, &*i); 显然多一点工作，但如果你的 str 很大，你会发现它更快。

[Live example]

将字符串拆分为等长子字符串的更快方法

Faster way to split a string into substrings of equal length

c++

r

rcpp