带注释的 Boost spirit x3 分词器不起作用
Boost spirit x3 tokenizer with annotation does not work
我最近尝试使用 boost spirit x3 实现最简单的 tokenizer。我现在面临的挑战是检索每个标记在输入流中的位置。
官网上有很好的注解教程:https://www.boost.org/doc/libs/develop/libs/spirit/doc/x3/html/spirit_x3/tutorials/annotation.html。然而,它有一些局限性:它基本上解析了一系列相同(同质)的本质,而在现实生活中往往并非如此。
所以我试图创建具有 2 个要素的分词器:空格(空格序列)和单行注释(以 //
开头,一直持续到行尾)。
请参阅问题末尾的最小示例代码。
但是,我在尝试检索任何标记的位置时遇到错误。经过一些调试后,我发现 annotate_position::on_success
句柄将 T
类型推断为 boost::spirit::x3::unused_type
,但我不知道为什么。
所以,我有几个问题:
- 我做错了什么? (我知道这不是 Whosebug 的风格,但我已经为此苦苦挣扎了几天,没有人可以咨询)。我一直在尝试将实际评论作为字符串存储在
SingleLineComment
和 Whitespace
class 中,但没有成功。我怀疑这是因为在解析器中省略了注释和空格字符串,有没有办法解决这个问题?
- 什么是解析异构结构的最佳实践方法?
- 我是否应该为此任务使用一些专门的库(即应该使用
grammar
class 或 spirit::lex
,但是 x3版本)
- 是否有一些分词器的示例(我正在查看 Getting started guide for Boost.Spirit?,但它有点过时了)?就目前而言,我认为文档还不够广泛,无法立即开始编写一些东西,我正在考虑手动编写分词器。被宣传为一个简单的“开始吧”库,结果却是一堆复杂的几乎没有记录的模板,我不完全理解。
这是一个最小的示例代码片段:
#include <string>
#include <iostream>
#include <functional>
#include <vector>
#include <optional>
#include <variant>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
using namespace std;
namespace x3 = boost::spirit::x3;
struct position_cache_tag;
// copy paste from boost documentation example
struct annotate_position
{
template <typename T, typename Iterator, typename Context>
inline void on_success(Iterator const &first, Iterator const &last, T &ast, Context const &context)
{
auto &position_cache = x3::get<position_cache_tag>(context).get();
position_cache.annotate(ast, first, last);
}
};
struct SingleLineComment : public x3::position_tagged
{
// no need to store actual comment string,
// since it is position tagged and
// we can then find the corresponding
// iterators afterwards, is this right?
};
struct Whitespace : public x3::position_tagged
{
// same reasoning
};
// here can be another token types (e.g. MultilineComment, integer, identifier etc.)
struct Token : public x3::position_tagged
{
// unites SingleLineComment and Whitespace
// into a single Token class
enum class Type
{
SingleLineComment,
Whitespace
};
std::optional<Type> type; // type field should be set by semantic action
// std::optional is kind of reinsurance that type will be set
std::optional<std::variant<SingleLineComment, Whitespace>> data;
// same reasoning for std::optional
// this filed might be needed for more complex
// tokens, which hold additional data
};
// unique on success hook classes
struct SingleLineCommentHook : public annotate_position
{
};
struct WhitespaceHook : public annotate_position
{
};
struct TokenHook : public annotate_position
{
};
// rules
const x3::rule<SingleLineCommentHook, SingleLineComment> singleLineComment = "single line comment";
const x3::rule<WhitespaceHook, Whitespace> whitespace = "whitespace";
const x3::rule<TokenHook, Token> token = "token";
// rule definitions
const auto singleLineComment_def = x3::lit("//") >> x3::omit[*(x3::char_ - '\n')];
const auto whitespace_def = x3::omit[+x3::ascii::space];
BOOST_SPIRIT_DEFINE(singleLineComment, whitespace);
auto _setSingleLineComment = [](const auto &context) {
x3::_val(context).type = Token::Type::SingleLineComment;
x3::_val(context).data = x3::_attr(context);
};
auto _setWhitespace = [](const auto &context) {
x3::_val(context).type = Token::Type::Whitespace;
x3::_val(context).data = x3::_attr(context);
};
const auto token_def = (singleLineComment[_setSingleLineComment] | whitespace[_setWhitespace]);
BOOST_SPIRIT_DEFINE(token);
int main()
{
// copy paste from boost documentation example
using iterator_type = std::string::const_iterator;
using position_cache = boost::spirit::x3::position_cache<std::vector<iterator_type>>;
std::string content = R"(// first single line comment
// second single line comment
)";
// expect 4 tokens: comment -> whitespace -> comment -> whitespace
position_cache positions{content.cbegin(), content.cend()};
std::vector<Token> tokens;
const auto parser = x3::with<position_cache_tag>(std::ref(positions))[*token];
auto start = content.cbegin();
auto success = x3::phrase_parse(start, content.cend(), parser, x3::eps(false), tokens);
success &= (start == content.cend());
cout << boolalpha << success << endl;
cout << "Found " << tokens.size() << " tokens" << endl;
for (auto &token : tokens)
cout << (token.type.value() == Token::Type::SingleLineComment ? "comment" : "space") << endl;
// all good till this point
// now I want to get a position
// the following throws
auto pos = positions.position_of(tokens.front());
}
感谢阅读,期待回复!
当涉及语义动作时,on_success
似乎不会发生。
事实上,您是在冗余地标记 Ast 节点和变体。
您可能已经获得了第一个标记的正确结果,例如
auto pos = positions.position_of(
std::get<SingleLineComment>(tokens.front().data)));
由于需要静态类型切换,这显然不是很方便。
这里有一个更简单的:
#include <iostream>
#include <iomanip>
#include <variant>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
namespace x3 = boost::spirit::x3;
struct SingleLineComment{};
struct Whitespace {};
using Variant = std::variant<SingleLineComment, Whitespace>;
struct Token : Variant, x3::position_tagged {
using Variant::Variant;
};
namespace {
struct position_cache_tag;
namespace Parser {
struct annotate_position {
template <typename T, typename Iterator, typename Context>
inline void on_success(Iterator first, Iterator last, T &ast, Context const &context) const {
auto &position_cache = x3::get<position_cache_tag>(context);
position_cache.annotate(ast, first, last);
}
};
// unique on success hook classes
template <typename> struct Hook {}; // no annotate_position mix-in
template <> struct Hook<Token> : annotate_position {};
template <typename T>
static auto constexpr as = [](auto p, char const* name = typeid(decltype(p)).name()) {
return x3::rule<Hook<T>, T> {name} = p;
};
// rule definitions
auto singleLineComment = as<SingleLineComment>("//" >> x3::omit[*(x3::char_ - x3::eol)]);
auto whitespace = as<Whitespace> (x3::omit[+x3::ascii::space]);
auto token = as<Token> (singleLineComment | whitespace, "token");
}
}
int main() {
using It = std::string::const_iterator;
using position_cache = x3::position_cache<std::vector<It>>;
std::string const content = R"(// first single line comment
// second single line comment
)";
position_cache positions{content.begin(), content.end()};
auto parser = x3::with<position_cache_tag>(positions)[*Parser::token];
std::vector<Token> tokens;
if (parse(begin(content), end(content), parser >> x3::eoi, tokens)) {
std::cout << "Found " << tokens.size() << " tokens" << std::endl;
for (auto& token : tokens) {
auto pos = positions.position_of(token);
std::cout
<< (token.index() ? "space" : "comment") << "\t"
<< std::quoted(std::string_view(&*pos.begin(), pos.size()))
<< std::endl;
}
}
}
版画
Found 4 tokens
comment "// first single line comment"
space "
"
comment "// second single line comment"
space "
"
我最近尝试使用 boost spirit x3 实现最简单的 tokenizer。我现在面临的挑战是检索每个标记在输入流中的位置。
官网上有很好的注解教程:https://www.boost.org/doc/libs/develop/libs/spirit/doc/x3/html/spirit_x3/tutorials/annotation.html。然而,它有一些局限性:它基本上解析了一系列相同(同质)的本质,而在现实生活中往往并非如此。
所以我试图创建具有 2 个要素的分词器:空格(空格序列)和单行注释(以 //
开头,一直持续到行尾)。
请参阅问题末尾的最小示例代码。
但是,我在尝试检索任何标记的位置时遇到错误。经过一些调试后,我发现 annotate_position::on_success
句柄将 T
类型推断为 boost::spirit::x3::unused_type
,但我不知道为什么。
所以,我有几个问题:
- 我做错了什么? (我知道这不是 Whosebug 的风格,但我已经为此苦苦挣扎了几天,没有人可以咨询)。我一直在尝试将实际评论作为字符串存储在
SingleLineComment
和Whitespace
class 中,但没有成功。我怀疑这是因为在解析器中省略了注释和空格字符串,有没有办法解决这个问题? - 什么是解析异构结构的最佳实践方法?
- 我是否应该为此任务使用一些专门的库(即应该使用
grammar
class 或spirit::lex
,但是 x3版本) - 是否有一些分词器的示例(我正在查看 Getting started guide for Boost.Spirit?,但它有点过时了)?就目前而言,我认为文档还不够广泛,无法立即开始编写一些东西,我正在考虑手动编写分词器。被宣传为一个简单的“开始吧”库,结果却是一堆复杂的几乎没有记录的模板,我不完全理解。
这是一个最小的示例代码片段:
#include <string>
#include <iostream>
#include <functional>
#include <vector>
#include <optional>
#include <variant>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
using namespace std;
namespace x3 = boost::spirit::x3;
struct position_cache_tag;
// copy paste from boost documentation example
struct annotate_position
{
template <typename T, typename Iterator, typename Context>
inline void on_success(Iterator const &first, Iterator const &last, T &ast, Context const &context)
{
auto &position_cache = x3::get<position_cache_tag>(context).get();
position_cache.annotate(ast, first, last);
}
};
struct SingleLineComment : public x3::position_tagged
{
// no need to store actual comment string,
// since it is position tagged and
// we can then find the corresponding
// iterators afterwards, is this right?
};
struct Whitespace : public x3::position_tagged
{
// same reasoning
};
// here can be another token types (e.g. MultilineComment, integer, identifier etc.)
struct Token : public x3::position_tagged
{
// unites SingleLineComment and Whitespace
// into a single Token class
enum class Type
{
SingleLineComment,
Whitespace
};
std::optional<Type> type; // type field should be set by semantic action
// std::optional is kind of reinsurance that type will be set
std::optional<std::variant<SingleLineComment, Whitespace>> data;
// same reasoning for std::optional
// this filed might be needed for more complex
// tokens, which hold additional data
};
// unique on success hook classes
struct SingleLineCommentHook : public annotate_position
{
};
struct WhitespaceHook : public annotate_position
{
};
struct TokenHook : public annotate_position
{
};
// rules
const x3::rule<SingleLineCommentHook, SingleLineComment> singleLineComment = "single line comment";
const x3::rule<WhitespaceHook, Whitespace> whitespace = "whitespace";
const x3::rule<TokenHook, Token> token = "token";
// rule definitions
const auto singleLineComment_def = x3::lit("//") >> x3::omit[*(x3::char_ - '\n')];
const auto whitespace_def = x3::omit[+x3::ascii::space];
BOOST_SPIRIT_DEFINE(singleLineComment, whitespace);
auto _setSingleLineComment = [](const auto &context) {
x3::_val(context).type = Token::Type::SingleLineComment;
x3::_val(context).data = x3::_attr(context);
};
auto _setWhitespace = [](const auto &context) {
x3::_val(context).type = Token::Type::Whitespace;
x3::_val(context).data = x3::_attr(context);
};
const auto token_def = (singleLineComment[_setSingleLineComment] | whitespace[_setWhitespace]);
BOOST_SPIRIT_DEFINE(token);
int main()
{
// copy paste from boost documentation example
using iterator_type = std::string::const_iterator;
using position_cache = boost::spirit::x3::position_cache<std::vector<iterator_type>>;
std::string content = R"(// first single line comment
// second single line comment
)";
// expect 4 tokens: comment -> whitespace -> comment -> whitespace
position_cache positions{content.cbegin(), content.cend()};
std::vector<Token> tokens;
const auto parser = x3::with<position_cache_tag>(std::ref(positions))[*token];
auto start = content.cbegin();
auto success = x3::phrase_parse(start, content.cend(), parser, x3::eps(false), tokens);
success &= (start == content.cend());
cout << boolalpha << success << endl;
cout << "Found " << tokens.size() << " tokens" << endl;
for (auto &token : tokens)
cout << (token.type.value() == Token::Type::SingleLineComment ? "comment" : "space") << endl;
// all good till this point
// now I want to get a position
// the following throws
auto pos = positions.position_of(tokens.front());
}
感谢阅读,期待回复!
当涉及语义动作时,on_success
似乎不会发生。
事实上,您是在冗余地标记 Ast 节点和变体。
您可能已经获得了第一个标记的正确结果,例如
auto pos = positions.position_of(
std::get<SingleLineComment>(tokens.front().data)));
由于需要静态类型切换,这显然不是很方便。
这里有一个更简单的:
#include <iostream>
#include <iomanip>
#include <variant>
#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/ast/position_tagged.hpp>
namespace x3 = boost::spirit::x3;
struct SingleLineComment{};
struct Whitespace {};
using Variant = std::variant<SingleLineComment, Whitespace>;
struct Token : Variant, x3::position_tagged {
using Variant::Variant;
};
namespace {
struct position_cache_tag;
namespace Parser {
struct annotate_position {
template <typename T, typename Iterator, typename Context>
inline void on_success(Iterator first, Iterator last, T &ast, Context const &context) const {
auto &position_cache = x3::get<position_cache_tag>(context);
position_cache.annotate(ast, first, last);
}
};
// unique on success hook classes
template <typename> struct Hook {}; // no annotate_position mix-in
template <> struct Hook<Token> : annotate_position {};
template <typename T>
static auto constexpr as = [](auto p, char const* name = typeid(decltype(p)).name()) {
return x3::rule<Hook<T>, T> {name} = p;
};
// rule definitions
auto singleLineComment = as<SingleLineComment>("//" >> x3::omit[*(x3::char_ - x3::eol)]);
auto whitespace = as<Whitespace> (x3::omit[+x3::ascii::space]);
auto token = as<Token> (singleLineComment | whitespace, "token");
}
}
int main() {
using It = std::string::const_iterator;
using position_cache = x3::position_cache<std::vector<It>>;
std::string const content = R"(// first single line comment
// second single line comment
)";
position_cache positions{content.begin(), content.end()};
auto parser = x3::with<position_cache_tag>(positions)[*Parser::token];
std::vector<Token> tokens;
if (parse(begin(content), end(content), parser >> x3::eoi, tokens)) {
std::cout << "Found " << tokens.size() << " tokens" << std::endl;
for (auto& token : tokens) {
auto pos = positions.position_of(token);
std::cout
<< (token.index() ? "space" : "comment") << "\t"
<< std::quoted(std::string_view(&*pos.begin(), pos.size()))
<< std::endl;
}
}
}
版画
Found 4 tokens
comment "// first single line comment"
space "
"
comment "// second single line comment"
space "
"