提升正则表达式迭代器返回空字符串
boost regex iterator returning empty string
我是 c++ 正则表达式的初学者,我想知道为什么这段代码:
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main() {
std::string s = "? 8==2 : true ! false";
boost::regex re("\?\s+(.*)\s*:\s*(.*)\s*\!\s*(.*)");
boost::sregex_token_iterator p(s.begin(), s.end(), re, -1); // sequence and that reg exp
boost::sregex_token_iterator end; // Create an end-of-reg-exp
// marker
while (p != end)
std::cout << *p++ << '\n';
}
打印空字符串。我把正则表达式放在 regexTester 中,它正确地匹配了字符串,但是在这里当我尝试遍历匹配时它 returns 什么都没有。
我认为分词器实际上是用一些定界符分割文本,而定界符不包括在内。与 std::regex_token_iterator
比较:
std::regex_token_iterator
is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
确实,您确实按照 the docs:
调用了此模式
if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting).
(强调我的)。
所以,修正一下:
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
其他观察结果
所有贪婪的 Kleene-stars 都是麻烦的根源。您永远不会找到第二个匹配项,因为最后一个匹配项的 .*
将 根据定义 吞噬所有剩余的输入。
相反,让它们成为非贪婪的 (.*?
) 和/或更精确的(比如隔离一些字符集,或强制使用非 space 字符?)。
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
// Or, if you don't want raw string literals:
boost::regex re("\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?)");
#include <boost/regex.hpp>
#include <iomanip>
#include <iostream>
#include <string>
int main() {
using It = std::string::const_iterator;
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
{
std::cout << "=== regex_search:\n";
boost::smatch results;
for (It b = s.begin(); boost::regex_search(b, s.end(), results, re); b = results[0].end()) {
std::cout << results.str() << "\n";
std::cout << "remain: " << std::quoted(std::string(results[0].second, s.end())) << "\n";
}
}
std::cout << "=== token iteration:\n";
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
}
版画
=== regex_search:
? 8==2 : true !
remain: "false;? 9==3 : 'book' ! 'library';"
? 9==3 : 'book' !
remain: "'library';"
=== token iteration:
"? 8==2 : true ! "
"? 9==3 : 'book' ! "
奖励:解析器表达式
您可以生成一个解析器,而不是滥用正则表达式来进行解析,例如使用 Boost Spirit:
#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
#include <iostream>
namespace x3 = boost::spirit::x3;
int main() {
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
using expression = std::string;
using ternary = std::tuple<expression, expression, expression>;
std::vector<ternary> parsed;
auto expr_ = x3::lexeme [+(x3::graph - ';')];
auto ternary_ = "?" >> expr_ >> ":" >> expr_ >> "!" >> expr_;
std::cout << "=== parser approach:\n";
if (x3::phrase_parse(begin(s), end(s), *x3::seek[ ternary_ ], x3::space, parsed)) {
for (auto [cond, e1, e2] : parsed) {
std::cout
<< " condition " << std::quoted(cond) << "\n"
<< " true expression " << std::quoted(e1) << "\n"
<< " else expression " << std::quoted(e2) << "\n"
<< "\n";
}
} else {
std::cout << "non matching" << '\n';
}
}
版画
=== parser approach:
condition "8==2"
true expression "true"
else expression "false"
condition "9==3"
true expression "'book'"
else expression "'library'"
这更具可扩展性,将轻松支持递归语法,并且能够合成语法树的类型化表示,而不是只留下零散的字符串。
我是 c++ 正则表达式的初学者,我想知道为什么这段代码:
#include <iostream>
#include <string>
#include <boost/regex.hpp>
int main() {
std::string s = "? 8==2 : true ! false";
boost::regex re("\?\s+(.*)\s*:\s*(.*)\s*\!\s*(.*)");
boost::sregex_token_iterator p(s.begin(), s.end(), re, -1); // sequence and that reg exp
boost::sregex_token_iterator end; // Create an end-of-reg-exp
// marker
while (p != end)
std::cout << *p++ << '\n';
}
打印空字符串。我把正则表达式放在 regexTester 中,它正确地匹配了字符串,但是在这里当我尝试遍历匹配时它 returns 什么都没有。
我认为分词器实际上是用一些定界符分割文本,而定界符不包括在内。与 std::regex_token_iterator
比较:
std::regex_token_iterator
is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).
确实,您确实按照 the docs:
调用了此模式if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting).
(强调我的)。
所以,修正一下:
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
其他观察结果
所有贪婪的 Kleene-stars 都是麻烦的根源。您永远不会找到第二个匹配项,因为最后一个匹配项的 .*
将 根据定义 吞噬所有剩余的输入。
相反,让它们成为非贪婪的 (.*?
) 和/或更精确的(比如隔离一些字符集,或强制使用非 space 字符?)。
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
// Or, if you don't want raw string literals:
boost::regex re("\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?)");
#include <boost/regex.hpp>
#include <iomanip>
#include <iostream>
#include <string>
int main() {
using It = std::string::const_iterator;
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
{
std::cout << "=== regex_search:\n";
boost::smatch results;
for (It b = s.begin(); boost::regex_search(b, s.end(), results, re); b = results[0].end()) {
std::cout << results.str() << "\n";
std::cout << "remain: " << std::quoted(std::string(results[0].second, s.end())) << "\n";
}
}
std::cout << "=== token iteration:\n";
for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
++p)
{
boost::sub_match<It> const& current = *p;
if (current.matched) {
std::cout << std::quoted(current.str()) << '\n';
} else {
std::cout << "non matching" << '\n';
}
}
}
版画
=== regex_search:
? 8==2 : true !
remain: "false;? 9==3 : 'book' ! 'library';"
? 9==3 : 'book' !
remain: "'library';"
=== token iteration:
"? 8==2 : true ! "
"? 9==3 : 'book' ! "
奖励:解析器表达式
您可以生成一个解析器,而不是滥用正则表达式来进行解析,例如使用 Boost Spirit:
#include <boost/spirit/home/x3.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
#include <iostream>
namespace x3 = boost::spirit::x3;
int main() {
std::string const s =
"? 8==2 : true ! false;"
"? 9==3 : 'book' ! 'library';";
using expression = std::string;
using ternary = std::tuple<expression, expression, expression>;
std::vector<ternary> parsed;
auto expr_ = x3::lexeme [+(x3::graph - ';')];
auto ternary_ = "?" >> expr_ >> ":" >> expr_ >> "!" >> expr_;
std::cout << "=== parser approach:\n";
if (x3::phrase_parse(begin(s), end(s), *x3::seek[ ternary_ ], x3::space, parsed)) {
for (auto [cond, e1, e2] : parsed) {
std::cout
<< " condition " << std::quoted(cond) << "\n"
<< " true expression " << std::quoted(e1) << "\n"
<< " else expression " << std::quoted(e2) << "\n"
<< "\n";
}
} else {
std::cout << "non matching" << '\n';
}
}
版画
=== parser approach:
condition "8==2"
true expression "true"
else expression "false"
condition "9==3"
true expression "'book'"
else expression "'library'"
这更具可扩展性,将轻松支持递归语法,并且能够合成语法树的类型化表示,而不是只留下零散的字符串。