字符串匹配性能:gcc 与 CPython
String matching performance: gcc versus CPython
在研究 Python 和 C++ 之间的性能权衡时,我设计了一个小示例,主要关注哑子字符串匹配。
这里是相关的C++:
using std::string;
std::vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
[&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );
以上是用-O3构建的。
这里是 Python:
def getMatchingPatterns(patterns, text):
return filter(text.__contains__, patterns)
他们都采用大量模式和输入文件,并使用哑子字符串搜索将模式列表过滤到文件中找到的模式。
版本是:
- gcc - 4.8.2 (Ubuntu) 和 4.9.2 (cygwin)
- python - 2.7.6 (Ubuntu) 和 2.7.8 (cygwin)
令我惊讶的是表演。我 运行 在低规格 Ubuntu 和 Python 上都快了大约 20%。在带有 cygwin 的中等规格 PC 上也是如此 - Python 快两倍。
Profiler 显示 99+% 的周期花在字符串匹配上(字符串复制和列表推导无关紧要)。
显然,Python 实现是原生 C,我预计它与 C++ 大致相同,但没想到它这么快。
任何有关 CPython 与 gcc 相比的优化的见解都是非常受欢迎的。
作为参考,这里有完整的例子。输入仅采用一组 50K HTLM(每次测试均从磁盘读取,无特殊缓存):
Python:
import sys
def getMatchingPatterns(patterns, text):
return filter(text.__contains__, patterns)
def serialScan(filenames, patterns):
return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames])
if __name__ == "__main__":
with open(sys.argv[1]) as filenamesListFile:
filenames = filenamesListFile.read().split()
with open(sys.argv[2]) as patternsFile:
patterns = patternsFile.read().split()
resultTuple = serialScan(filenames, patterns)
for filename, patterns in resultTuple:
print ': '.join([filename, ','.join(patterns)])
C++:
#include <iostream>
#include <iterator>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>
using namespace std;
using MatchResult = unordered_map<string, vector<string>>;
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;
MatchResult serialMatch(const vector<string> &filenames, const vector<string> &patterns)
{
MatchResult res;
for (auto &filename : filenames)
{
ifstream file(filename);
const string fileContents((istreambuf_iterator<char>(file)),
istreambuf_iterator<char>());
vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
[&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );
res.insert(make_pair(filename, std::move(matches)));
}
return res;
}
int main(int argc, char **argv)
{
vector<string> filenames;
ifstream filenamesListFile(argv[1]);
std::copy(istream_iterator<string>(filenamesListFile), istream_iterator<string>(),
back_inserter(filenames));
vector<string> patterns;
patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
ifstream patternsFile(argv[2]);
std::copy(istream_iterator<string>(patternsFile), istream_iterator<string>(),
back_inserter(patterns));
auto matchResult = serialMatch(filenames, patterns);
for (const auto &matchItem : matchResult)
{
cout << matchItem.first << ": ";
for (const auto &matchString : matchItem.second)
cout << matchString << ",";
cout << endl;
}
}
python 3.4 代码 b'abc' in b'abcabc'
(或您示例中的 b'abcabc'.__contains__(b'abc')
)执行 bytes_contains
method, which in turn calls the inlined function stringlib_find
; which delegates the search to FASTSEARCH
.
FASTSEARCH
函数然后使用简化的 Boyer-Moore search algorithm (Boyer-Moore-Horspool):
fast search/count implementation, based on a mix between boyer-
moore and horspool, with a few more bells and whistles on the top.
for some more background, see: http://effbot.org/zone/stringlib.htm
也有一些修改,如评论所述:
note: fastsearch may access s[n]
, which isn't a problem when using
Python's ordinary string types, but may cause problems if you're
using this code in other contexts. also, the count mode returns -1
if there cannot possible be a match in the target string, and 0
if
it has actually checked for matches, but didn't find any. callers
beware!
GNU C++ Standard Library basic_string<T>::find()
实现尽可能通用(和愚蠢);它只是尝试在每个连续的字符位置愚蠢地匹配模式,直到找到匹配项。
TL;DR:C++ 标准库与 Python 相比如此慢的原因是因为它试图在 [= 之上做一个泛型算法20=],但对于更有趣的情况未能有效地做到这一点;而在 Python 中,程序员可以根据具体情况免费获得最有效的算法。
在研究 Python 和 C++ 之间的性能权衡时,我设计了一个小示例,主要关注哑子字符串匹配。
这里是相关的C++:
using std::string;
std::vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
[&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );
以上是用-O3构建的。
这里是 Python:
def getMatchingPatterns(patterns, text):
return filter(text.__contains__, patterns)
他们都采用大量模式和输入文件,并使用哑子字符串搜索将模式列表过滤到文件中找到的模式。
版本是:
- gcc - 4.8.2 (Ubuntu) 和 4.9.2 (cygwin)
- python - 2.7.6 (Ubuntu) 和 2.7.8 (cygwin)
令我惊讶的是表演。我 运行 在低规格 Ubuntu 和 Python 上都快了大约 20%。在带有 cygwin 的中等规格 PC 上也是如此 - Python 快两倍。 Profiler 显示 99+% 的周期花在字符串匹配上(字符串复制和列表推导无关紧要)。
显然,Python 实现是原生 C,我预计它与 C++ 大致相同,但没想到它这么快。
任何有关 CPython 与 gcc 相比的优化的见解都是非常受欢迎的。
作为参考,这里有完整的例子。输入仅采用一组 50K HTLM(每次测试均从磁盘读取,无特殊缓存):
Python:
import sys
def getMatchingPatterns(patterns, text):
return filter(text.__contains__, patterns)
def serialScan(filenames, patterns):
return zip(filenames, [getMatchingPatterns(patterns, open(filename).read()) for filename in filenames])
if __name__ == "__main__":
with open(sys.argv[1]) as filenamesListFile:
filenames = filenamesListFile.read().split()
with open(sys.argv[2]) as patternsFile:
patterns = patternsFile.read().split()
resultTuple = serialScan(filenames, patterns)
for filename, patterns in resultTuple:
print ': '.join([filename, ','.join(patterns)])
C++:
#include <iostream>
#include <iterator>
#include <fstream>
#include <string>
#include <vector>
#include <unordered_map>
#include <algorithm>
using namespace std;
using MatchResult = unordered_map<string, vector<string>>;
static const size_t PATTERN_RESERVE_DEFAULT_SIZE = 5000;
MatchResult serialMatch(const vector<string> &filenames, const vector<string> &patterns)
{
MatchResult res;
for (auto &filename : filenames)
{
ifstream file(filename);
const string fileContents((istreambuf_iterator<char>(file)),
istreambuf_iterator<char>());
vector<string> matches;
std::copy_if(patterns.cbegin(), patterns.cend(), back_inserter(matches),
[&fileContents] (const string &pattern) { return fileContents.find(pattern) != string::npos; } );
res.insert(make_pair(filename, std::move(matches)));
}
return res;
}
int main(int argc, char **argv)
{
vector<string> filenames;
ifstream filenamesListFile(argv[1]);
std::copy(istream_iterator<string>(filenamesListFile), istream_iterator<string>(),
back_inserter(filenames));
vector<string> patterns;
patterns.reserve(PATTERN_RESERVE_DEFAULT_SIZE);
ifstream patternsFile(argv[2]);
std::copy(istream_iterator<string>(patternsFile), istream_iterator<string>(),
back_inserter(patterns));
auto matchResult = serialMatch(filenames, patterns);
for (const auto &matchItem : matchResult)
{
cout << matchItem.first << ": ";
for (const auto &matchString : matchItem.second)
cout << matchString << ",";
cout << endl;
}
}
python 3.4 代码 b'abc' in b'abcabc'
(或您示例中的 b'abcabc'.__contains__(b'abc')
)执行 bytes_contains
method, which in turn calls the inlined function stringlib_find
; which delegates the search to FASTSEARCH
.
FASTSEARCH
函数然后使用简化的 Boyer-Moore search algorithm (Boyer-Moore-Horspool):
fast search/count implementation, based on a mix between boyer- moore and horspool, with a few more bells and whistles on the top. for some more background, see: http://effbot.org/zone/stringlib.htm
也有一些修改,如评论所述:
note: fastsearch may access
s[n]
, which isn't a problem when using Python's ordinary string types, but may cause problems if you're using this code in other contexts. also, the count mode returns-1
if there cannot possible be a match in the target string, and0
if it has actually checked for matches, but didn't find any. callers beware!
GNU C++ Standard Library basic_string<T>::find()
实现尽可能通用(和愚蠢);它只是尝试在每个连续的字符位置愚蠢地匹配模式,直到找到匹配项。
TL;DR:C++ 标准库与 Python 相比如此慢的原因是因为它试图在 [= 之上做一个泛型算法20=],但对于更有趣的情况未能有效地做到这一点;而在 Python 中,程序员可以根据具体情况免费获得最有效的算法。