迭代一个字符串,但将分隔符保留在子字符串中,包括其他规则
Iterate over a string, but keep the delimiters in sub strings, including other rules
我正在尝试遍历一个字符串,以便它分解成附加到向量末尾的子字符串。另外,我也在尝试制定其他一些规则。 (撇号被认为是字母数字,如果 ',' 出现在数字之间,如果 '.' 出现在 digit/whitespace 之前或数字之间,则可以)
例如:
This'.isatest!!!!andsuch .1,00,0.011#$%@
结果会是:
myvector[This'][.][isatest][!!!!][andsuch][.1,00,0.011][#$%@]
我在拆分非字母数字字符(和撇号)以及“,”和“.”的 if 语句时没有问题,但我 运行 在保留分隔符方面遇到了麻烦。目前,我得到的更像是:
myvector[This'][.][isatest][!][!][!][!][andsuch][.1,00,0.011][#][$][%][@]
有什么有用的提示吗?
您想使用一些特定领域的产品(例如:"comma delimited numbers")进行标记化。我选择的武器是 Boost: Boost Spirit 中的解析器生成器。
Note I've added a
给你:
#include <boost/spirit/home/x3.hpp>
#include <cassert>
using Tokens = std::vector<std::string>;
Tokens smart_split(std::string const& s) {
Tokens tokens;
using namespace boost::spirit::x3;
auto wordc = char_("a-zA-Z'");
parse(s.begin(), s.end(), *raw [double_%','| +wordc | +~wordc], tokens);
return tokens;
}
#include <iostream>
#include <iomanip>
int main()
{
Tokens const expected { "This'",".","isatest","!!!!","andsuch",".1,00,0.11","#$%@" };
Tokens const actual = smart_split("This'.isatest!!!!andsuch.1,00,0.11#$%@");
for (auto t : actual)
std::cout << std::quoted(t) << ",";
assert(actual == expected);
}
版画
"This'",".","isatest","!!!!","andsuch",".1,00,0.11","#$%@",
因为我可能有点布谷鸟,所以除了 .
之外,我还花时间做了另一个手动解析器
如您所见,它并不完全简单。它很乏味、容易出错、难以维护并且更不通用。你选!
Pro Tip: Write code you understand. That gives you the fleeting chance to maintain it.
#include <string>
#include <iterator>
#include <algorithm>
#include <iostream>
template <typename Out>
Out smart_split(char const* first, char const* last, Out out) {
auto it = first;
std::string token;
auto emit = [&] {
if (!token.empty())
*out++ = token;
token.clear();
return out;
};
enum { NUMBER_LIST, OTHER } state = OTHER;
while (it != last) {
#ifndef NDEBUG
std::cout << std::string(it - first, ' ') << std::string(it, last) << " (token: '" << token << "')\n";
#endif
if (std::isdigit(*it) || *it == '-' || *it == '+' || *it == '.') {
if (state != NUMBER_LIST)
emit();
char* e;
std::strtod(it, &e);
if (it < e) {
token.append(it, static_cast<char const*>(e));
it = e;
if (it != last && *it == ',') {
token += *it++;
state = NUMBER_LIST;
}
}
else {
token += *it++;
}
}
else if (std::isalpha(*it) || *it == '\'') {
state = OTHER;
emit();
while (it != last && (std::isalpha(*it) || *it == '\'')) {
token += *it++;
}
emit();
}
else {
if (state == NUMBER_LIST)
emit();
state = OTHER;
token += *it++;
}
}
return emit();
}
#include <vector>
typedef std::vector<std::string> Tokens;
int main()
{
std::string const input = "This'.isatest!!!!andsuch.1,00,0.11#$%@";
Tokens actual;
smart_split(input.data(), input.data() + input.size(), back_inserter(actual));
for (auto& token : actual)
std::cout << token << "\n";
}
打印:
This'
.
isatest
!!!!
andsuch
.1,00,0.11
#$%@
在 DEBUG 构建的情况下,它还通过循环跟踪进度:
This'.isatest!!!!andsuch.1,00,0.11#$%@ (token: '')
.isatest!!!!andsuch.1,00,0.11#$%@ (token: '')
isatest!!!!andsuch.1,00,0.11#$%@ (token: '.')
!!!!andsuch.1,00,0.11#$%@ (token: '')
!!!andsuch.1,00,0.11#$%@ (token: '!')
!!andsuch.1,00,0.11#$%@ (token: '!!')
!andsuch.1,00,0.11#$%@ (token: '!!!')
andsuch.1,00,0.11#$%@ (token: '!!!!')
.1,00,0.11#$%@ (token: '')
00,0.11#$%@ (token: '.1,')
0.11#$%@ (token: '.1,00,')
#$%@ (token: '.1,00,0.11')
$%@ (token: '#')
%@ (token: '#$')
@ (token: '#$%')
我正在尝试遍历一个字符串,以便它分解成附加到向量末尾的子字符串。另外,我也在尝试制定其他一些规则。 (撇号被认为是字母数字,如果 ',' 出现在数字之间,如果 '.' 出现在 digit/whitespace 之前或数字之间,则可以)
例如:
This'.isatest!!!!andsuch .1,00,0.011#$%@
结果会是:
myvector[This'][.][isatest][!!!!][andsuch][.1,00,0.011][#$%@]
我在拆分非字母数字字符(和撇号)以及“,”和“.”的 if 语句时没有问题,但我 运行 在保留分隔符方面遇到了麻烦。目前,我得到的更像是:
myvector[This'][.][isatest][!][!][!][!][andsuch][.1,00,0.011][#][$][%][@]
有什么有用的提示吗?
您想使用一些特定领域的产品(例如:"comma delimited numbers")进行标记化。我选择的武器是 Boost: Boost Spirit 中的解析器生成器。
Note I've added a
给你:
#include <boost/spirit/home/x3.hpp>
#include <cassert>
using Tokens = std::vector<std::string>;
Tokens smart_split(std::string const& s) {
Tokens tokens;
using namespace boost::spirit::x3;
auto wordc = char_("a-zA-Z'");
parse(s.begin(), s.end(), *raw [double_%','| +wordc | +~wordc], tokens);
return tokens;
}
#include <iostream>
#include <iomanip>
int main()
{
Tokens const expected { "This'",".","isatest","!!!!","andsuch",".1,00,0.11","#$%@" };
Tokens const actual = smart_split("This'.isatest!!!!andsuch.1,00,0.11#$%@");
for (auto t : actual)
std::cout << std::quoted(t) << ",";
assert(actual == expected);
}
版画
"This'",".","isatest","!!!!","andsuch",".1,00,0.11","#$%@",
因为我可能有点布谷鸟,所以除了
如您所见,它并不完全简单。它很乏味、容易出错、难以维护并且更不通用。你选!
Pro Tip: Write code you understand. That gives you the fleeting chance to maintain it.
#include <string>
#include <iterator>
#include <algorithm>
#include <iostream>
template <typename Out>
Out smart_split(char const* first, char const* last, Out out) {
auto it = first;
std::string token;
auto emit = [&] {
if (!token.empty())
*out++ = token;
token.clear();
return out;
};
enum { NUMBER_LIST, OTHER } state = OTHER;
while (it != last) {
#ifndef NDEBUG
std::cout << std::string(it - first, ' ') << std::string(it, last) << " (token: '" << token << "')\n";
#endif
if (std::isdigit(*it) || *it == '-' || *it == '+' || *it == '.') {
if (state != NUMBER_LIST)
emit();
char* e;
std::strtod(it, &e);
if (it < e) {
token.append(it, static_cast<char const*>(e));
it = e;
if (it != last && *it == ',') {
token += *it++;
state = NUMBER_LIST;
}
}
else {
token += *it++;
}
}
else if (std::isalpha(*it) || *it == '\'') {
state = OTHER;
emit();
while (it != last && (std::isalpha(*it) || *it == '\'')) {
token += *it++;
}
emit();
}
else {
if (state == NUMBER_LIST)
emit();
state = OTHER;
token += *it++;
}
}
return emit();
}
#include <vector>
typedef std::vector<std::string> Tokens;
int main()
{
std::string const input = "This'.isatest!!!!andsuch.1,00,0.11#$%@";
Tokens actual;
smart_split(input.data(), input.data() + input.size(), back_inserter(actual));
for (auto& token : actual)
std::cout << token << "\n";
}
打印:
This'
.
isatest
!!!!
andsuch
.1,00,0.11
#$%@
在 DEBUG 构建的情况下,它还通过循环跟踪进度:
This'.isatest!!!!andsuch.1,00,0.11#$%@ (token: '')
.isatest!!!!andsuch.1,00,0.11#$%@ (token: '')
isatest!!!!andsuch.1,00,0.11#$%@ (token: '.')
!!!!andsuch.1,00,0.11#$%@ (token: '')
!!!andsuch.1,00,0.11#$%@ (token: '!')
!!andsuch.1,00,0.11#$%@ (token: '!!')
!andsuch.1,00,0.11#$%@ (token: '!!!')
andsuch.1,00,0.11#$%@ (token: '!!!!')
.1,00,0.11#$%@ (token: '')
00,0.11#$%@ (token: '.1,')
0.11#$%@ (token: '.1,00,')
#$%@ (token: '.1,00,0.11')
$%@ (token: '#')
%@ (token: '#$')
@ (token: '#$%')