使用 boost spirit 拆分字符串
Splitting string using boost spirit
这是个好主意吗?出于某种原因,我认为它应该比 boost 的分词器或拆分器更快。然而大多数时候我都被困在 boost::spirit::compile
template <typename Iterator>
struct ValueList : bsq::grammar<Iterator, std::vector<std::string>()>
{
ValueList(const std::string& sep, bool isCaseSensitive) : ValueList::base_type(query)
{
if(isCaseSensitive)
{
query = value >> *(sep >> value);
value = *(bsq::char_ - sep);
}
else
{
auto separator = bsq::no_case[sep];
query = value >> *(separator >> value);
value = *(bsq::char_ - separator);
}
}
bsq::rule<Iterator, std::vector<std::string>()> query;
bsq::rule<Iterator, std::string()> value;
};
inline bool Split(std::vector<std::string>& result, const std::string& buffer, const std::string& separator,
bool isCaseSensitive)
{
result.clear();
ValueList<std::string::const_iterator> parser(separator, isCaseSensitive);
auto itBeg = buffer.begin();
auto itEnd = buffer.end();
if(!(bsq::parse(itBeg, itEnd, parser, result) && (itBeg == itEnd)))
result.push_back(buffer);
return true;
}
我已经实现了,如上所示。我的代码有什么问题?或者只是因为分隔符是在运行时定义的,所以重新编译是不可避免的?
编辑001:
boost::split 的可能实现示例和与 CoLiRu 上带有分词器的原始 imp 的可能实现的比较
看起来 coliru 现在倒闭了。在任何情况下,这些都是在带有分隔符“|”
的字符串“2lkhj309|ioperwkl|20sdf39i|rjjdsf|klsdjf230o|kx23904iep2|xp39f4p2|xlmq2i3219”上运行 1M 的结果
8000000 splits in 1081ms.
8000000 splits in 1169ms.
8000000
splits in 2663ms.
第一个用于分词器,第二个用于 boost::split,第三个用于 boost::spirit
首先,不同的版本不会做同样的事情:
- 特别是令牌压缩的行为不同(我为
boost::split
修复了它,但它似乎不是 boost::tokenizer
的功能)
- 分隔符在 Spirit 版本中被视为字符串文字而不是字符集(已修复)
是的,使用动态分隔符不可避免地会重新编译。但是不,这不是瓶颈(其他方法也有动态分隔符):
我做了一些优化。时间:
-
8000000 original (boost::tokenizer) rate: 2.84257μs
10000000 possible (boost::split) rate: 3.09941μs
10000000 spirit (dynamic version) rate: 1.45456μs
10000000 spirit (direct version, avoid type erasure) rate: 1.25588μs
next step:
10000000 spirit (precompiled sep) rate: 1.18059μs
-
8000000 original (boost::tokenizer) rate: 2.92805μs
10000000 possible (boost::split) rate: 2.75442μs
10000000 spirit (dynamic version) rate: 1.32821μs
10000000 spirit (direct version, avoid type erasure) rate: 1.10712μs
next step:
10000000 spirit (precompiled sep) rate: 1.0791μs
本地系统 g++:
sehe@desktop:/tmp$ time ./test
8000000 original (boost::tokenizer) rate: 1.80061μs
10000000 possible (boost::split) rate: 1.29754μs
10000000 spirit (dynamic version) rate: 0.607789μs
10000000 spirit (direct version, avoid type erasure) rate: 0.488087μs
next step:
10000000 spirit (precompiled sep) rate: 0.498769μs
如您所见,Spirit 方法不需要更慢。我采取了哪些步骤? http://paste.ubuntu.com/11001344/
- 将基准重构为显示速率(无变化)3.523μs
- 选择case_insensitivity呼叫者(如果需要,只需使用
no_case[char_(delimiter)]
)2.742μs。
- 消除
value
子规则(由于类型擦除的非终端规则减少了复制和动态调度)2.579μs.
制作分隔符字符集而不是字符串文字:2.693μs.
see intermediate version on coliru. (I recommend the code below, it's much more cleaned up)
使用qi::raw[]代替std::string合成属性(避免复制!)0.624072μs
- 消除所有非终端(即类型擦除;参见
spirit_direct
实现)率:0.491011μs
现在看来很明显,所有的实现都将受益于每次都不是 "compiling" 分隔符。我没有对所有方法都这样做,但为了好玩,让我们为 Spirit 版本这样做:
- 使用硬编码“|”定界符0.455269μs //固定定界符
完整列表:
#include <boost/algorithm/string.hpp>
#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>
#include <vector>
#include <string>
#include <chrono>
#include <iostream>
void original(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::char_separator<char> sep(delimiter.c_str());
boost::tokenizer<boost::char_separator<char>, std::string::const_iterator, std::string> tok(input, sep);
for (const auto& token : tok)
{
result.push_back(token);
}
}
void possible(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::split(result, input, boost::is_any_of(delimiter), boost::algorithm::token_compress_off);
}
namespace bsq = boost::spirit::qi;
void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
using namespace bsq;
if (!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
result.push_back(input);
}
namespace detail {
template <typename Sep> bsq::rule<std::string::const_iterator, std::vector<std::string>()>
make_spirit_parser(Sep const& sep)
{
using namespace bsq;
return raw[*(char_ - sep)] % sep;
}
static const auto precompiled_pipes = make_spirit_parser('|');
}
void spirit(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::make_spirit_parser(bsq::char_(delimiter)), result))
result.push_back(input);
}
void spirit_pipes(std::vector<std::string>& result, const std::string& input)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::precompiled_pipes, result))
result.push_back(input);
}
template <typename F> void bench(std::string const& caption, F approach) {
size_t const iterations = 1000000;
using namespace std::chrono;
using C = high_resolution_clock;
auto start = C::now();
size_t count = 0;
for (auto i = 0U; i < iterations; ++i) {
count += approach();
}
auto us = duration_cast<std::chrono::microseconds>(C::now() - start).count();
std::cout << count << " " << caption << " rate: " << (1.*us/iterations) << "μs\n";
}
int main()
{
std::string const input = "2309|ioperwkl|2039i|rjjdsf|klsdjf230o|kx23904iep2|xp,39,4p2|xlmq2i3219||";
auto separator = "|";
std::vector<std::string> result;
bench("original (boost::tokenizer)", [&] {
original(result, input, separator);
return result.size();
});
bench("possible (boost::split)", [&] {
possible(result, input, separator);
return result.size();
});
bench("spirit (dynamic version)", [&] {
spirit(result, input, separator);
return result.size();
});
bench("spirit (direct version, avoid type erasure)", [&] {
spirit_direct(result, input, separator);
return result.size();
});
std::cout << "\nnext step:\n";
bench("spirit (precompiled sep)", [&] {
spirit_pipes(result, input);
return result.size();
});
}
这是个好主意吗?出于某种原因,我认为它应该比 boost 的分词器或拆分器更快。然而大多数时候我都被困在 boost::spirit::compile
template <typename Iterator>
struct ValueList : bsq::grammar<Iterator, std::vector<std::string>()>
{
ValueList(const std::string& sep, bool isCaseSensitive) : ValueList::base_type(query)
{
if(isCaseSensitive)
{
query = value >> *(sep >> value);
value = *(bsq::char_ - sep);
}
else
{
auto separator = bsq::no_case[sep];
query = value >> *(separator >> value);
value = *(bsq::char_ - separator);
}
}
bsq::rule<Iterator, std::vector<std::string>()> query;
bsq::rule<Iterator, std::string()> value;
};
inline bool Split(std::vector<std::string>& result, const std::string& buffer, const std::string& separator,
bool isCaseSensitive)
{
result.clear();
ValueList<std::string::const_iterator> parser(separator, isCaseSensitive);
auto itBeg = buffer.begin();
auto itEnd = buffer.end();
if(!(bsq::parse(itBeg, itEnd, parser, result) && (itBeg == itEnd)))
result.push_back(buffer);
return true;
}
我已经实现了,如上所示。我的代码有什么问题?或者只是因为分隔符是在运行时定义的,所以重新编译是不可避免的?
编辑001:
boost::split 的可能实现示例和与 CoLiRu 上带有分词器的原始 imp 的可能实现的比较
看起来 coliru 现在倒闭了。在任何情况下,这些都是在带有分隔符“|”
8000000 splits in 1081ms.
8000000 splits in 1169ms.
8000000 splits in 2663ms.
第一个用于分词器,第二个用于 boost::split,第三个用于 boost::spirit
首先,不同的版本不会做同样的事情:
- 特别是令牌压缩的行为不同(我为
boost::split
修复了它,但它似乎不是boost::tokenizer
的功能) - 分隔符在 Spirit 版本中被视为字符串文字而不是字符集(已修复)
是的,使用动态分隔符不可避免地会重新编译。但是不,这不是瓶颈(其他方法也有动态分隔符):
我做了一些优化。时间:
-
8000000 original (boost::tokenizer) rate: 2.84257μs 10000000 possible (boost::split) rate: 3.09941μs 10000000 spirit (dynamic version) rate: 1.45456μs 10000000 spirit (direct version, avoid type erasure) rate: 1.25588μs next step: 10000000 spirit (precompiled sep) rate: 1.18059μs
-
8000000 original (boost::tokenizer) rate: 2.92805μs 10000000 possible (boost::split) rate: 2.75442μs 10000000 spirit (dynamic version) rate: 1.32821μs 10000000 spirit (direct version, avoid type erasure) rate: 1.10712μs next step: 10000000 spirit (precompiled sep) rate: 1.0791μs
本地系统 g++:
sehe@desktop:/tmp$ time ./test 8000000 original (boost::tokenizer) rate: 1.80061μs 10000000 possible (boost::split) rate: 1.29754μs 10000000 spirit (dynamic version) rate: 0.607789μs 10000000 spirit (direct version, avoid type erasure) rate: 0.488087μs next step: 10000000 spirit (precompiled sep) rate: 0.498769μs
如您所见,Spirit 方法不需要更慢。我采取了哪些步骤? http://paste.ubuntu.com/11001344/
- 将基准重构为显示速率(无变化)3.523μs
- 选择case_insensitivity呼叫者(如果需要,只需使用
no_case[char_(delimiter)]
)2.742μs。 - 消除
value
子规则(由于类型擦除的非终端规则减少了复制和动态调度)2.579μs. 制作分隔符字符集而不是字符串文字:2.693μs.
see intermediate version on coliru. (I recommend the code below, it's much more cleaned up)
使用qi::raw[]代替std::string合成属性(避免复制!)0.624072μs
- 消除所有非终端(即类型擦除;参见
spirit_direct
实现)率:0.491011μs
现在看来很明显,所有的实现都将受益于每次都不是 "compiling" 分隔符。我没有对所有方法都这样做,但为了好玩,让我们为 Spirit 版本这样做:
- 使用硬编码“|”定界符0.455269μs //固定定界符
完整列表:
#include <boost/algorithm/string.hpp>
#include <boost/tokenizer.hpp>
#include <boost/spirit/include/qi.hpp>
#include <vector>
#include <string>
#include <chrono>
#include <iostream>
void original(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::char_separator<char> sep(delimiter.c_str());
boost::tokenizer<boost::char_separator<char>, std::string::const_iterator, std::string> tok(input, sep);
for (const auto& token : tok)
{
result.push_back(token);
}
}
void possible(std::vector<std::string>& result, const std::string& input, const std::string& delimiter)
{
result.clear();
boost::split(result, input, boost::is_any_of(delimiter), boost::algorithm::token_compress_off);
}
namespace bsq = boost::spirit::qi;
void spirit_direct(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
using namespace bsq;
if (!parse(input.begin(), input.end(), raw[*(char_ - char_(delimiter))] % char_(delimiter), result))
result.push_back(input);
}
namespace detail {
template <typename Sep> bsq::rule<std::string::const_iterator, std::vector<std::string>()>
make_spirit_parser(Sep const& sep)
{
using namespace bsq;
return raw[*(char_ - sep)] % sep;
}
static const auto precompiled_pipes = make_spirit_parser('|');
}
void spirit(std::vector<std::string>& result, const std::string& input, char const* delimiter)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::make_spirit_parser(bsq::char_(delimiter)), result))
result.push_back(input);
}
void spirit_pipes(std::vector<std::string>& result, const std::string& input)
{
result.clear();
if (!bsq::parse(input.begin(), input.end(), detail::precompiled_pipes, result))
result.push_back(input);
}
template <typename F> void bench(std::string const& caption, F approach) {
size_t const iterations = 1000000;
using namespace std::chrono;
using C = high_resolution_clock;
auto start = C::now();
size_t count = 0;
for (auto i = 0U; i < iterations; ++i) {
count += approach();
}
auto us = duration_cast<std::chrono::microseconds>(C::now() - start).count();
std::cout << count << " " << caption << " rate: " << (1.*us/iterations) << "μs\n";
}
int main()
{
std::string const input = "2309|ioperwkl|2039i|rjjdsf|klsdjf230o|kx23904iep2|xp,39,4p2|xlmq2i3219||";
auto separator = "|";
std::vector<std::string> result;
bench("original (boost::tokenizer)", [&] {
original(result, input, separator);
return result.size();
});
bench("possible (boost::split)", [&] {
possible(result, input, separator);
return result.size();
});
bench("spirit (dynamic version)", [&] {
spirit(result, input, separator);
return result.size();
});
bench("spirit (direct version, avoid type erasure)", [&] {
spirit_direct(result, input, separator);
return result.size();
});
std::cout << "\nnext step:\n";
bench("spirit (precompiled sep)", [&] {
spirit_pipes(result, input);
return result.size();
});
}