Boost 正则表达式与几个正则表达式网站不匹配

Question

我正在尝试使用正则表达式解析一个字符串，这样当我遍历它的匹配项时，它只会给我结果。我的目标是找到所有

#include <stuff.h>
#include "stuff.h"

如果它们是评论块的一部分则忽略它们，例如

/*
     #include "stuff.h"
*/

这是我的函数，用于读取文件、将其转换为字符串并解析字符串、创建标记，然后迭代这些标记以将它们全部打印出来。根据前几行，令牌将包含 stuff.h ， stuff.h 。

我运行遇到的问题是使用这个正则表达式 https://regex101.com/r/tQFDr4/2

问题是，我的正则表达式是错误的还是函数中有什么东西？

void header_check::filename(const boost::filesystem::directory_iterator& itr)  //function takes directory path                     
{                                                                                                   
    std::string delimeter ("#include.+(?:<|\\")(.+)(?:>|\\")(?![^*\/]* (?:\*+(?!\/)[^*\/]*|\/+(?!\*)[^*\/]*)*\*\/)");//regex storage                                                                      
    boost::regex regx(delimeter,boost::regex::perl);//set up regex                                                  
    boost::smatch match;                                                                              
    std::ifstream file (itr->path().string().c_str());//stream to transfer to stream
    std::string content((std::istreambuf_iterator<char>(file)),    
    std::istreambuf_iterator<char>());//string to be parsed
    boost::sregex_token_iterator iter (content.begin(),content.end(), regx, 0);    //creates a match for each search
    boost::sregex_token_iterator end;                                                                 
    for (int attempt =1; iter != end; ++iter) {                                                       
        std::cout<< *iter<<" include #"<<attempt++<<"\n";  //prints results                                             
    }                                                       
}

Answer 1

首先，您在正则表达式中多了一个 space 字符。

但真正的问题是您将整个输入视为一行。如果你设置那个标志：

你会发现 regex101 shows the same results.

在正则表达式中，默认情况下所有开放量词都是贪婪的。因此，您必须更加具体。一开始你有

#include.+

这已经是它的结尾，因为 .+ 只匹配所有内容（直到并包括最后一行）。您唯一的缓刑是将发生回溯，以便至少有 1 个 "tail" 正则表达式匹配，但其余所有 "souped up" 介于两者之间。因为 .+ 从字面上要求 1 or as many as possible of any character!

已尝试修复...

使 .+ 成为 \s+ 左右。事实上，它需要是 \s* 因为 #include<iostream> 是完全有效的 C++
接下来，您不能像以前那样匹配，因为您很乐意匹配 #include <iostream" 或 #include "iostream>。同样，.* 需要加以限制。在这种情况下，您可以使结束定界符完全确定（因为开始定界符完全可以预测），因此您可以使用 non-greedy Kleene-star:
```
#include\s*("(.*?)"|<(.*?)>)
```

然而

真正的问题是您正在尝试使用... regexen¹ 解析完整的语法。

我只能说

Could you not?!

以下是使用 Boost Spirit 的建议：

auto comment_ = space 
              | "//" >> *(char_ - eol) 
              | "/*" >> *(char_ - "*/")
              ;

哇哦。那是一股清新的空气。这几乎就像编程，而不是魔法和祈祷！

现在是真正的肉：

auto include_ = "#include" >> (
        '<' >> *~char_('>') >> '>'
      | '"' >> *~char_('"') >> '"'
      );

当然你也想要布丁的证明：

std::string header;
bool ok = phrase_parse(content.begin(), content.end(), seek[include_], comment_, header);

std::cout << "matched: " << std::boolalpha << ok << ": " << header << "\n";

这会解析单个 header 并打印：Live On Coliru

matched: true: iostream

扩展到所有 non-commented 包括是小菜一碟吗？

std::vector<std::string> headers;
bool ok = phrase_parse(content.begin(), content.end(), *seek[include_], comment_, headers);

Oops. Two bugs。首先，我们不应该匹配我们的语法。最好的方法是确保我们处于行首，但这会使语法复杂化。现在，让我们禁止名称跨越多行：

auto name_ = rule<struct _, std::string> {} = lexeme[
      '<' >> *(char_ - '>' - eol) >> '>'
    | '"' >> *(char_ - '"' - eol) >> '"'
];

auto include_ = "#include" >> name_;

这有点帮助。另一个错误实际上更棘手，我认为这是一个库错误。问题是它认为所有包含都是活动的？事实证明 seek 在第一场比赛后没有正确使用船长。² 现在，让我们解决这个问题：

bool ok = phrase_parse(content.begin(), content.end(), *(omit[*(char_ - include_)] >> include_) , comment_, headers);

它确实有点不优雅，但确实有效：

满月

完整演示 Live On Coliru

// #include <boost/graph/adjacency_list.hpp>

#include "iostream"

#include<fstream> /*
#include <boost/filesystem.hpp>
#include <boost/regex.hpp> */ //
#include <boost/spirit/home/x3.hpp>


void filename(std::string const& fname)  //function takes directory path                     
{                                                                                                   
    using namespace boost::spirit::x3;

    auto comment_ = space 
          | "//" >> *(char_ - eol) 
          | "/*" >> *(char_ - "*/")
          ;

    auto name_ = rule<struct _, std::string> {} = lexeme[
          '<' >> *(char_ - '>' - eol) >> '>'
        | '"' >> *(char_ - '"' - eol) >> '"'
    ];

    auto include_ = "#include" >> name_;

    auto const content = [&]() -> std::string {
        std::ifstream file(fname);
        return { std::istreambuf_iterator<char>{file}, {} };//string to be parsed
    }();

    std::vector<std::string> headers;
    /*bool ok = */phrase_parse(content.begin(), content.end(), *(omit[*(char_ - include_)] >> include_) , comment_, headers);

    std::cout << "matched: " << headers.size() << " active includes:\n";
    for (auto& header : headers)
        std::cout << " - " << header << "\n";
}

int main() {
    filename("main.cpp");
}

打印

matched: 3 active includes:
 - iostream
 - fstream
 - boost/spirit/home/x3.hpp

¹ 而且它不在 Perl6 中，在这种情况下你可以原谅。

² 我明天会尝试 fix/report 这个

Boost 正则表达式与几个正则表达式网站不匹配

Boost regex is not matching the same as several regex websites

parsing

boost

comments

boost-regex

c++11

已尝试修复...

然而

满月