Spirit X3:自定义数字解析器在结果中产生意外的前导零
Spirit X3: Custom number parser yield unexpected leading zero in the result
我正在编写一个长数字解析器,它识别一个有效数字(可能无法用内置整数类型表示)并按原样存储字符串。但结果包含意外的前导“0”。
解析器简单地识别 0xHHHHHH
、ObBBBBBBB
、0OOOOOOO
或 DDDDDDDDD
形式的数字
为了保留结果中的数字前缀,我使用x3::string
而不是x3::lit
,前者的解析器属性为String
,而后者为unused
这里是 link 代码 https://wandbox.org/permlink/E8mOpCcH3Svqb3FJ
如果 link 已过期,则使用相同的代码。
#include <boost/spirit/home/x3.hpp>
#include <iostream>
namespace x3 = boost::spirit::x3;
namespace fusion = boost::fusion;
using x3::_val;
using x3::_attr;
using x3::_where;
using fusion::at_c;
x3::rule<struct LongHexInt, std::string> const long_hex_int = "long_hex_int";
auto const long_hex_int_def = x3::lexeme[
(x3::string("0") >> +x3::char_('0', '7'))
| ((x3::digit - '0') >> *x3::digit >> 'u')
| ((x3::string("0x") | x3::string("0X")) >> +x3::xdigit)
| ((x3::string("0b") | x3::string("0B")) >> +x3::char_('0', '1'))
];
BOOST_SPIRIT_DEFINE(long_hex_int);
int main() {
std::string input = R"__(0x12345678ABCDEF)__";
std::string output;
if (x3::parse(input.begin(), input.end(), long_hex_int, output)) {
std::cout << output;
}
}
结果显示,解析器output
是00x12345678ABCDEF
而不是0x12345678ABCDEF
,我不知道额外的'0'来自哪里
删除第 15 行 ((x3::string("0") >> +x3::char_('0', '7'))
) 中的交替后,代码产生了预期的输出。但是不知道为什么,是bug还是我的错?
发生这种情况是因为其中一个备选方案的失败不会导致属性回滚。要强制回滚,请将它们分开规则,like this:
#include <boost/spirit/home/x3.hpp>
#include <iostream>
namespace x3 = boost::spirit::x3;
namespace fusion = boost::fusion;
using x3::_val;
using x3::_attr;
using x3::_where;
using x3::rule;
using fusion::at_c;
x3::rule<struct LongHexInt, std::string> const long_hex_int = "long_hex_int";
template <typename T>
auto as = [](auto p) {
return rule<struct _, T> {} = p;
};
auto const long_hex_int_def =
x3::lexeme[as<std::string>(x3::string("0") >> +x3::char_('0', '7'))
| as<std::string>((x3::digit - '0') >> *x3::digit >> 'u')
| as<std::string>((x3::string("0x") | x3::string("0X")) >> +x3::xdigit)
| as<std::string>((x3::string("0b") | x3::string("0B")) >> +x3::char_('0', '1'))]
;
BOOST_SPIRIT_DEFINE(long_hex_int);
int main() {
std::string input = R"__(0x12345678ABCDEF)__";
std::string output;
if (x3::parse(input.begin(), input.end(), long_hex_int, output)) {
std::cout << output;
}
}
(它似乎不能在 Boost 1.70 中工作!也许是一个错误?)
在尝试@IgorR.的代码和一些调试后,我发现Spirit X3从boost 1.70开始删除了一些属性副本,所以当用可变容器对象解析时,自动回滚不再可用,因为你需要一个副本属性来做到这一点。
所以boost spirit引入了一个属性转换机制,它在传递规则之前和之后调用,默认实现只是return的参考,所以自定义属性转换行为std::string的制作副本将解决问题。基本上你需要像这样的结构:
@user1681377 指出下面的代码中有不必要的属性复制,现在只用移动操作编辑,仍然有开销,但少了很多。
template<>
struct boost::spirit::x3::default_transform_attribute<std::string, std::string> {
typedef std::string type;
static std::string pre(std::string& val) { return std::move(val); }
static void post(std::string& old_val, std::string&& new_val) {
old_val = std::move(new_val);
}
};
那么问题就解决了。见 https://wandbox.org/permlink/MLYLbSeXBBjDqATN
顺便说一句。 @sehe 认为黑客实施不是一个好主意,我同意,但对于目前的情况,也许这是最简单的方法?我怀疑这个 transform_attribute
是自定义点吗?
我个人会简化。数字格式的公共部分可以写成:
auto const common
= x3::no_case["0x"] >> x3::hex
| x3::no_case["0b"] >> x3::bin
| &x3::lit('0') >> x3::oct
| x3::uint_ >> 'u'
;
这使用来自 https://www.boost.org/doc/libs/1_71_0/libs/spirit/doc/html/spirit/qi/reference/numeric/uint.html
的内置未签名解析器
现在您可以将其解析为字符串表示形式:
auto const long_hex_int
= x3::rule<struct long_hex_int_, std::string> {"long_hex_int"}
= x3::lexeme [ x3::raw [ common ] ];
但您可以直接解析为整数类型:
auto const unsigned_literal
= x3::rule<struct unsigned_literal_, uint32_t> {"unsigned_literal"}
= x3::lexeme [ common ];
事实上,这是一个带有测试用例的现场演示:
for (std::string const input : {
"0",
"00",
"010",
"0x0", "0b0", "0x10", "0b10", "0x010", "0b010",
"0X0", "0B0", "0X10", "0B10", "0X010", "0B010",
// fails:
"", "0x", "0b", "0x12345678ABCDEF" })
{
std::string str;
uint32_t num;
if (x3::parse(input.begin(), input.end(), long_hex_int >> x3::eoi, str)) {
std::cout << std::quoted(input) << " -> " << std::quoted(str) << "\n";
if (x3::parse(input.begin(), input.end(), unsigned_literal, num)) {
std::cout << " numerical: " << std::hex << "0x" << num << " (" << std::dec << num << ")\n";
}
} else {
std::cout << std::quoted(input) << " -> FAILED\n";
}
}
打印:
"0" -> "0"
numerical: 0x0 (0)
"00" -> "00"
numerical: 0x0 (0)
"010" -> "010"
numerical: 0x8 (8)
"0x0" -> "0x0"
numerical: 0x0 (0)
"0b0" -> "0b0"
numerical: 0x0 (0)
"0x10" -> "0x10"
numerical: 0x10 (16)
"0b10" -> "0b10"
numerical: 0x2 (2)
"0x010" -> "0x010"
numerical: 0x10 (16)
"0b010" -> "0b010"
numerical: 0x2 (2)
"0X0" -> "0X0"
numerical: 0x0 (0)
"0B0" -> "0B0"
numerical: 0x0 (0)
"0X10" -> "0X10"
numerical: 0x10 (16)
"0B10" -> "0B10"
numerical: 0x2 (2)
"0X010" -> "0X010"
numerical: 0x10 (16)
"0B010" -> "0B010"
numerical: 0x2 (2)
"" -> FAILED
"0x" -> FAILED
"0b" -> FAILED
"0x12345678ABCDEF" -> FAILED
扩展为 64 位
扩展更精确应该会取得更多成功,对吗?
写起来只会稍微烦人一点:
template <typename T = uint64_t>
auto const common
= x3::no_case["0x"] >> x3::uint_parser<T, 16>{}
| x3::no_case["0b"] >> x3::uint_parser<T, 2>{}
| &x3::lit('0') >> x3::uint_parser<T, 8>{}
| x3::uint_parser<T, 10>{} >> 'u'
;
但其余的都是一样的,你的64位例子通过了:
"0x12345678ABCDEF" -> 0x12345678abcdef (5124095577148911)
但是 131!
解析失败,原因很明显:
"847158069087882051098456875815279568163352087665474498775849754305766436915303927682164623187034167333264599970492141556534816949699515865660644961729169613882287309922474300878212776434073600000000000000000000000000000000" -> FAILED
奖励:任意精度
131!需要大约 log2(131!) ≅ 737 位...但是您不需要退回到拖着字符串。只需从 Boost Multiprecision 中删除 uint1024_t
(或 checked_uint1024_t
)即可完成:
using Number = boost::multiprecision::/*checked_*/uint1024_t;
然后
Number num;
if (x3::parse(input.begin(), input.end(), unsigned_literal<Number> >> x3::eoi, num)) {
std::cout << std::quoted(input) << " -> " << std::hex << "0x" << num << " (" << std::dec << num << ")\n";
} else {
std::cout << std::quoted(input) << " -> FAILED\n";
}
请注意除了 uint64_t
-> Number
什么都没有改变。输出:
"0" -> 0x0 (0)
"00" -> 0x0 (0)
"010" -> 0x8 (8)
"0x0" -> 0x0 (0)
"0b0" -> 0x0 (0)
"0x10" -> 0x10 (16)
"0b10" -> 0x2 (2)
"0x010" -> 0x10 (16)
"0b010" -> 0x2 (2)
"0X0" -> 0x0 (0)
"0B0" -> 0x0 (0)
"0X10" -> 0x10 (16)
"0B10" -> 0x2 (2)
"0X010" -> 0x10 (16)
"0B010" -> 0x2 (2)
"0x12345678ABCDEF" -> 0x12345678ABCDEF (5124095577148911)
"847158069087882051098456875815279568163352087665474498775849754305766436915303927682164623187034167333264599970492141556534816949699515865660644961729169613882287309922474300878212776434073600000000000000000000000000000000u" -> 0x257F7A37BE2FBDD9980A97214F27DDC1E2FFA53ABBA836FFBE8AD1B9792E5D47A3C573A1B9C81D264662E41005A5D7432ADDBE44E3DDF12142D2B845FC9B184288345AD466B86A6685FE87AE100000000000000000000000000000000 (847158069087882051098456875815279568163352087665474498775849754305766436915303927682164623187034167333264599970492141556534816949699515865660644961729169613882287309922474300878212776434073600000000000000000000000000000000)
我正在编写一个长数字解析器,它识别一个有效数字(可能无法用内置整数类型表示)并按原样存储字符串。但结果包含意外的前导“0”。
解析器简单地识别 0xHHHHHH
、ObBBBBBBB
、0OOOOOOO
或 DDDDDDDDD
为了保留结果中的数字前缀,我使用x3::string
而不是x3::lit
,前者的解析器属性为String
,而后者为unused
这里是 link 代码 https://wandbox.org/permlink/E8mOpCcH3Svqb3FJ
如果 link 已过期,则使用相同的代码。
#include <boost/spirit/home/x3.hpp>
#include <iostream>
namespace x3 = boost::spirit::x3;
namespace fusion = boost::fusion;
using x3::_val;
using x3::_attr;
using x3::_where;
using fusion::at_c;
x3::rule<struct LongHexInt, std::string> const long_hex_int = "long_hex_int";
auto const long_hex_int_def = x3::lexeme[
(x3::string("0") >> +x3::char_('0', '7'))
| ((x3::digit - '0') >> *x3::digit >> 'u')
| ((x3::string("0x") | x3::string("0X")) >> +x3::xdigit)
| ((x3::string("0b") | x3::string("0B")) >> +x3::char_('0', '1'))
];
BOOST_SPIRIT_DEFINE(long_hex_int);
int main() {
std::string input = R"__(0x12345678ABCDEF)__";
std::string output;
if (x3::parse(input.begin(), input.end(), long_hex_int, output)) {
std::cout << output;
}
}
结果显示,解析器output
是00x12345678ABCDEF
而不是0x12345678ABCDEF
,我不知道额外的'0'来自哪里
删除第 15 行 ((x3::string("0") >> +x3::char_('0', '7'))
) 中的交替后,代码产生了预期的输出。但是不知道为什么,是bug还是我的错?
发生这种情况是因为其中一个备选方案的失败不会导致属性回滚。要强制回滚,请将它们分开规则,like this:
#include <boost/spirit/home/x3.hpp>
#include <iostream>
namespace x3 = boost::spirit::x3;
namespace fusion = boost::fusion;
using x3::_val;
using x3::_attr;
using x3::_where;
using x3::rule;
using fusion::at_c;
x3::rule<struct LongHexInt, std::string> const long_hex_int = "long_hex_int";
template <typename T>
auto as = [](auto p) {
return rule<struct _, T> {} = p;
};
auto const long_hex_int_def =
x3::lexeme[as<std::string>(x3::string("0") >> +x3::char_('0', '7'))
| as<std::string>((x3::digit - '0') >> *x3::digit >> 'u')
| as<std::string>((x3::string("0x") | x3::string("0X")) >> +x3::xdigit)
| as<std::string>((x3::string("0b") | x3::string("0B")) >> +x3::char_('0', '1'))]
;
BOOST_SPIRIT_DEFINE(long_hex_int);
int main() {
std::string input = R"__(0x12345678ABCDEF)__";
std::string output;
if (x3::parse(input.begin(), input.end(), long_hex_int, output)) {
std::cout << output;
}
}
(它似乎不能在 Boost 1.70 中工作!也许是一个错误?)
在尝试@IgorR.的代码和一些调试后,我发现Spirit X3从boost 1.70开始删除了一些属性副本,所以当用可变容器对象解析时,自动回滚不再可用,因为你需要一个副本属性来做到这一点。
所以boost spirit引入了一个属性转换机制,它在传递规则之前和之后调用,默认实现只是return的参考,所以自定义属性转换行为std::string的制作副本将解决问题。基本上你需要像这样的结构:
@user1681377 指出下面的代码中有不必要的属性复制,现在只用移动操作编辑,仍然有开销,但少了很多。
template<>
struct boost::spirit::x3::default_transform_attribute<std::string, std::string> {
typedef std::string type;
static std::string pre(std::string& val) { return std::move(val); }
static void post(std::string& old_val, std::string&& new_val) {
old_val = std::move(new_val);
}
};
那么问题就解决了。见 https://wandbox.org/permlink/MLYLbSeXBBjDqATN
顺便说一句。 @sehe 认为黑客实施不是一个好主意,我同意,但对于目前的情况,也许这是最简单的方法?我怀疑这个 transform_attribute
是自定义点吗?
我个人会简化。数字格式的公共部分可以写成:
auto const common
= x3::no_case["0x"] >> x3::hex
| x3::no_case["0b"] >> x3::bin
| &x3::lit('0') >> x3::oct
| x3::uint_ >> 'u'
;
这使用来自 https://www.boost.org/doc/libs/1_71_0/libs/spirit/doc/html/spirit/qi/reference/numeric/uint.html
的内置未签名解析器现在您可以将其解析为字符串表示形式:
auto const long_hex_int
= x3::rule<struct long_hex_int_, std::string> {"long_hex_int"}
= x3::lexeme [ x3::raw [ common ] ];
但您可以直接解析为整数类型:
auto const unsigned_literal
= x3::rule<struct unsigned_literal_, uint32_t> {"unsigned_literal"}
= x3::lexeme [ common ];
事实上,这是一个带有测试用例的现场演示:
for (std::string const input : {
"0",
"00",
"010",
"0x0", "0b0", "0x10", "0b10", "0x010", "0b010",
"0X0", "0B0", "0X10", "0B10", "0X010", "0B010",
// fails:
"", "0x", "0b", "0x12345678ABCDEF" })
{
std::string str;
uint32_t num;
if (x3::parse(input.begin(), input.end(), long_hex_int >> x3::eoi, str)) {
std::cout << std::quoted(input) << " -> " << std::quoted(str) << "\n";
if (x3::parse(input.begin(), input.end(), unsigned_literal, num)) {
std::cout << " numerical: " << std::hex << "0x" << num << " (" << std::dec << num << ")\n";
}
} else {
std::cout << std::quoted(input) << " -> FAILED\n";
}
}
打印:
"0" -> "0"
numerical: 0x0 (0)
"00" -> "00"
numerical: 0x0 (0)
"010" -> "010"
numerical: 0x8 (8)
"0x0" -> "0x0"
numerical: 0x0 (0)
"0b0" -> "0b0"
numerical: 0x0 (0)
"0x10" -> "0x10"
numerical: 0x10 (16)
"0b10" -> "0b10"
numerical: 0x2 (2)
"0x010" -> "0x010"
numerical: 0x10 (16)
"0b010" -> "0b010"
numerical: 0x2 (2)
"0X0" -> "0X0"
numerical: 0x0 (0)
"0B0" -> "0B0"
numerical: 0x0 (0)
"0X10" -> "0X10"
numerical: 0x10 (16)
"0B10" -> "0B10"
numerical: 0x2 (2)
"0X010" -> "0X010"
numerical: 0x10 (16)
"0B010" -> "0B010"
numerical: 0x2 (2)
"" -> FAILED
"0x" -> FAILED
"0b" -> FAILED
"0x12345678ABCDEF" -> FAILED
扩展为 64 位
扩展更精确应该会取得更多成功,对吗?
写起来只会稍微烦人一点:
template <typename T = uint64_t>
auto const common
= x3::no_case["0x"] >> x3::uint_parser<T, 16>{}
| x3::no_case["0b"] >> x3::uint_parser<T, 2>{}
| &x3::lit('0') >> x3::uint_parser<T, 8>{}
| x3::uint_parser<T, 10>{} >> 'u'
;
但其余的都是一样的,你的64位例子通过了:
"0x12345678ABCDEF" -> 0x12345678abcdef (5124095577148911)
但是 131!
解析失败,原因很明显:
"847158069087882051098456875815279568163352087665474498775849754305766436915303927682164623187034167333264599970492141556534816949699515865660644961729169613882287309922474300878212776434073600000000000000000000000000000000" -> FAILED
奖励:任意精度
131!需要大约 log2(131!) ≅ 737 位...但是您不需要退回到拖着字符串。只需从 Boost Multiprecision 中删除 uint1024_t
(或 checked_uint1024_t
)即可完成:
using Number = boost::multiprecision::/*checked_*/uint1024_t;
然后
Number num;
if (x3::parse(input.begin(), input.end(), unsigned_literal<Number> >> x3::eoi, num)) {
std::cout << std::quoted(input) << " -> " << std::hex << "0x" << num << " (" << std::dec << num << ")\n";
} else {
std::cout << std::quoted(input) << " -> FAILED\n";
}
请注意除了 uint64_t
-> Number
什么都没有改变。输出:
"0" -> 0x0 (0)
"00" -> 0x0 (0)
"010" -> 0x8 (8)
"0x0" -> 0x0 (0)
"0b0" -> 0x0 (0)
"0x10" -> 0x10 (16)
"0b10" -> 0x2 (2)
"0x010" -> 0x10 (16)
"0b010" -> 0x2 (2)
"0X0" -> 0x0 (0)
"0B0" -> 0x0 (0)
"0X10" -> 0x10 (16)
"0B10" -> 0x2 (2)
"0X010" -> 0x10 (16)
"0B010" -> 0x2 (2)
"0x12345678ABCDEF" -> 0x12345678ABCDEF (5124095577148911)
"847158069087882051098456875815279568163352087665474498775849754305766436915303927682164623187034167333264599970492141556534816949699515865660644961729169613882287309922474300878212776434073600000000000000000000000000000000u" -> 0x257F7A37BE2FBDD9980A97214F27DDC1E2FFA53ABBA836FFBE8AD1B9792E5D47A3C573A1B9C81D264662E41005A5D7432ADDBE44E3DDF12142D2B845FC9B184288345AD466B86A6685FE87AE100000000000000000000000000000000 (847158069087882051098456875815279568163352087665474498775849754305766436915303927682164623187034167333264599970492141556534816949699515865660644961729169613882287309922474300878212776434073600000000000000000000000000000000)