使用 boost spirit 解析固定宽度的数字

Question

我正在使用 spirit 解析充满固定宽度数字的类似 fortran 的文本文件：

1234 0.000000000000D+001234
1234 7.654321000000D+001234
1234                   1234
1234-7.654321000000D+001234

有符号和无符号整数的解析器，但我找不到固定宽度实数的解析器，有人可以帮忙吗？

这是我的 Live On Coliru

#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
#include <iomanip>
namespace qi = boost::spirit::qi;

struct RECORD {
    uint16_t a{};
    double   b{};
    uint16_t c{};
};

BOOST_FUSION_ADAPT_STRUCT(RECORD, a,b,c)

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    qi::uint_parser<uint16_t, 10, 4, 4> i4;

    qi::rule<It, double()> X19 = qi::double_ //
        | qi::repeat(19)[' '] >> qi::attr(0.0);

    for (std::string const str : {
             "1234 0.000000000000D+001234",
             "1234 7.654321000000D+001234",
             "1234                   1234",
             "1234-7.654321000000D+001234",
         }) {

        It f = str.cbegin(), l = str.cend();

        RECORD rec;
        if (qi::parse(f, l, (i4 >> X19 >> i4), rec)) {
            std::cout << "{a:" << rec.a << ", b:" << rec.b << ", c:" << rec.c
                      << "}\n";
        } else {
            std::cout << "Parse fail (" << std::quoted(str) << ")\n";
        }
    }
}

这显然不解析大多数记录：

Parse fail ("1234 0.000000000000D+001234")
Parse fail ("1234 7.654321000000D+001234")
{a:1234, b:0, c:1234}
Parse fail ("1234-7.654321000000D+001234")

Answer 1

该机制存在，但它隐藏得更深，因为解析浮点数比解析整数有更多的细节。

qi::double_（和float_）实际上是qi::real_parser<double, qi::real_policies<double> >.

的实例

policies 是关键。它们管理接受何种格式的所有细节。

这是RealPolicies Expression Requirements

Expression	Semantics
`RP::allow_leading_dot`	Allow leading dot.
`RP::allow_trailing_dot`	Allow trailing dot.
`RP::expect_dot`	Require a dot.
`RP::parse_sign(f, l)`	Parse the prefix sign (e.g. '-'). Return `true` if successful, otherwise `false`.
`RP::parse_n(f, l, n)`	Parse the integer at the left of the decimal point. Return `true` if successful, otherwise `false`. If successful, place the result into n.
`RP::parse_dot(f, l)`	Parse the decimal point. Return `true` if successful, otherwise `false`.
`RP::parse_frac_n(f, l, n, d)`	Parse the fraction after the decimal point. Return `true` if successful, otherwise `false`. If successful, place the result into n and the number of digits into d
`RP::parse_exp(f, l)`	Parse the exponent prefix (e.g. 'e'). Return `true` if successful, otherwise `false`.
`RP::parse_exp_n(f, l, n)`	Parse the actual exponent. Return `true` if successful, otherwise `false`. If successful, place the result into n.
`RP::parse_nan(f, l, n)`	Parse a NaN. Return `true` if successful, otherwise `false`. If successful, place the result into n.
`RP::parse_inf(f, l, n)`	Parse an Inf. Return `true` if successful, otherwise `false`. If successful, place the result into n.

让我们实施您的政策：

namespace policies {
    /* mandatory sign (or space) fixed widths, 'D+' or 'D-' exponent leader */
    template <typename T, int IDigits, int FDigits, int EDigits = 2>
    struct fixed_widths_D : qi::strict_ureal_policies<T> {
        template <typename It> static bool parse_sign(It& f, It const& l);

        template <typename It, typename Attr>
        static bool parse_n(It& f, It const& l, Attr& a);

        template <typename It> static bool parse_exp(It& f, It const& l);

        template <typename It>
        static bool parse_exp_n(It& f, It const& l, int& a);

        template <typename It, typename Attr>
        static bool parse_frac_n(It& f, It const& l, Attr& a, int& n);
    };
} // namespace policies

备注：

我保持属性类型通用。
我的实现也是基于严格的 strict_urealpolicies 减少工作量。基础 class 没有支持符号，并且需要一个强制性的小数分隔符 ('.')，这使得它“严格”并拒绝整数
您的问题格式要求整数部分为 1 位数字，整数部分为 12 位数字分数和 2 作为指数，但我没有硬编码，所以我们可以重用其他固定宽度格式的策略（IDigits、FDigits、EDigits）

让我们逐一检查覆盖：

`bool parse_sign(f, l)`

格式是固定宽度的，所以要接受

前导 space 或 '+' 正数
负数前导“-”

这样符号总是需要一个输入字符：

template <typename It> static bool parse_sign(It& f, It const&l)
{
    if (f != l) {
        switch (*f) {
        case '+':
        case ' ': ++f; break;
        case '-': ++f; return true;
        }
    }
    return false;
}

`bool parse_n(f, l, Attr& a)`

最简单的部分：我们只允许在分隔符前有一位数（IDigits）无符号整数部分。幸运的是，整数解析相对常见且微不足道：

template <typename It, typename Attr>
static bool parse_n(It& f, It const& l, Attr& a)
{
    return qi::extract_uint<Attr, 10, IDigits, IDigits, false, true>::call(f, l, a);
}

`bool parse_exp(f, l)`

同样微不足道：我们总是需要 'D'：

template <typename It> static bool parse_exp(It& f, It const& l)
{
    if (f == l || *f != 'D')
        return false;
    ++f;
    return true;
}

`bool parse_exp_n(f, l, int& a)`

至于指数，我们希望它是固定宽度的，这意味着符号是强制的。因此，在提取宽度为 2 (EDigits) 的有符号整数之前，我们确保标志存在：

template <typename It>
static bool parse_exp_n(It& f, It const& l, int& a)
{
    if (f == l || !(*f == '+' || *f == '-'))
        return false;
    return qi::extract_int<int, 10, EDigits, EDigits>::call(f, l, a);
}

`bool parse_frac_n(f, l, Attr&, int& a)`

问题的实质，也是在现有解析器上构建的原因。小数位可以被认为是整数，但由于以下原因存在问题前导零是重要的以及数字的总数可能超过我们选择的任何整数类型的容量。

所以我们做了一个“技巧”——我们解析一个无符号整数，但忽略任何多余的不合适的精度：实际上我们只关心位数。我们然后检查这个数字是否符合预期：FDigits.

然后，我们移交给基础 class 实现来实际计算对于任何通用数字类型T（满足最低限度要求).

template <typename It, typename Attr>
static bool parse_frac_n(It& f, It const& l, Attr& a, int& n)
{
    It savef = f;

    if (qi::extract_uint<Attr, 10, FDigits, FDigits, true, true>::call(f, l, a)) {
        n = static_cast<int>(std::distance(savef, f));
        return n == FDigits;
    }
    return false;
}

总结

你可以看到，站在现有的、经过测试的代码的肩膀上，我们已经完成了并且可以很好地解析我们的数字：

template <typename T>
using X19_type = qi::real_parser<T, policies::fixed_widths_D<T, 1, 12, 2>>;

现在您的代码按预期运行：Live On Coliru

template <typename T>
using X19_type = qi::real_parser<T, policies::fixed_widths_D<T, 1, 12, 2>>;

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    qi::uint_parser<uint16_t, 10, 4, 4> i4;
    X19_type<double>                    x19;

    qi::rule<It, double()> X19 = x19 //
        | qi::repeat(19)[' '] >> qi::attr(0.0);

    for (std::string const str : {
             "1234                   1234",
             "1234 0.000000000000D+001234",
             "1234 7.065432100000D+001234",
             "1234-7.006543210000D+001234",
             "1234 0.065432100000D+031234",
             "1234 0.065432100000D-301234",
         }) {

        It f = str.cbegin(), l = str.cend();

        RECORD rec;
        if (qi::parse(f, l, (i4 >> X19 >> i4), rec)) {
            std::cout << "{a:" << rec.a << ", b:" << std::setprecision(12)
                      << rec.b << ", c:" << rec.c << "}\n";
        } else {
            std::cout << "Parse fail (" << std::quoted(str) << ")\n";
        }
    }
}

打印

{a:1234, b:0, c:1234}
{a:1234, b:0, c:1234}
{a:1234, b:7.0654321, c:1234}
{a:1234, b:-7.00654321, c:1234}
{a:1234, b:65.4321, c:1234}
{a:1234, b:6.54321e-32, c:1234}

小数

现在，可以以超过 double 的精度。而且转换总是有问题十进制数到不精确的二进制表示。展示如何选择因为泛型 T 已经满足了这一点，让我们用一个 decimal 类型实例化允许 64 位有效的十进制小数位：

Live On Coliru

using Decimal = boost::multiprecision::cpp_dec_float_100;

struct RECORD {
    uint16_t a{};
    Decimal  b{};
    uint16_t c{};
};

template <typename T>
using X71_type = qi::real_parser<T, policies::fixed_widths_D<T, 1, 64, 2>>;

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    qi::uint_parser<uint16_t, 10, 4, 4> i4;
    X71_type<Decimal>                   x71;

    qi::rule<It, Decimal()> X71 = x71 //
        | qi::repeat(71)[' '] >> qi::attr(0.0);

    for (std::string const str : {
             "1234                                                                       6789",
             "2345 0.0000000000000000000000000000000000000000000000000000000000000000D+006789",
             "3456 7.0000000000000000000000000000000000000000000000000000000000654321D+006789",
             "4567-7.0000000000000000000000000000000000000000000000000000000000654321D+006789",
             "5678 0.0000000000000000000000000000000000000000000000000000000000654321D+036789",
             "6789 0.0000000000000000000000000000000000000000000000000000000000654321D-306789",
         }) {

        It f = str.cbegin(), l = str.cend();

        RECORD rec;
        if (qi::parse(f, l, (i4 >> X71 >> i4), rec)) {
            std::cout << "{a:" << rec.a << ", b:" << std::setprecision(65)
                      << rec.b << ", c:" << rec.c << "}\n";
        } else {
            std::cout << "Parse fail (" << std::quoted(str) << ")\n";
        }
    }
}

打印

{a:2345, b:0, c:6789}
{a:3456, b:7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:4567, b:-7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:5678, b:6.54321e-56, c:6789}
{a:6789, b:6.54321e-89, c:6789}

Compare how using a binary long double representation would have lost accuracy here:

{a:2345, b:0, c:6789}
{a:3456, b:7, c:6789}
{a:4567, b:-7, c:6789}
{a:5678, b:6.5432100000000000002913506043764438647482181234694313277925965188e-56, c:6789}
{a:6789, b:6.5432100000000000000601529073044049029207066886931600941449474131e-89, c:6789}

加分项：可选项

在当前的 RECORD 中，缺失的双打被默认为 0.0。这可能不是最好的：

struct RECORD {
    uint16_t          a{};
    optional<Decimal> b{};
    uint16_t          c{};
};

// ...

qi::rule<It, optional<Decimal>()> X71 = x71 //
    | qi::repeat(71)[' '];

现在输出是 Live On Coliru:

{a:1234, b:--, c:6789}
{a:2345, b: 0, c:6789}
{a:3456, b: 7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:4567, b: -7.0000000000000000000000000000000000000000000000000000000000654321, c:6789}
{a:5678, b: 6.54321e-56, c:6789}
{a:6789, b: 6.54321e-89, c:6789}

总结/添加单元测试！

很多，但可能不是您需要的全部。

请记住，您仍然需要适当的单元测试，例如X19_type。思考在所有边缘情况中，您可以 encounter/want 到 accept/want 拒绝：

我没有更改任何处理 Inf 或 NaN 的基本策略，所以你可能想要缩小这些差距
你可能真的想接受 " 3.141 ", " .999999999999D+0 "等?

所有这些都是对政策的非常简单的更改，但是，如您所知，代码没有测试就坏了。

使用 boost spirit 解析固定宽度的数字

Parsing fixed width numbers with boost spirit

c++

parsing

boost-spirit

boost-spirit-qi

floating-point-conversion