如何在 C++ 中使用 boost 正则表达式解析转义元素 '\' 和 unicode 字符 '\u'

Question

我正在使用 C++ 中的 boost 正则表达式解析文本文件。我正在从文件中寻找“\”字符。此文件还包含一些 unicode '\u' 字符。那么，有没有办法将'\'和'\u'字符分开。以下是我正在解析的 test.txt 的内容

"ID": "\u01FE234DA - this is id ",
"speed": "96\/78",
"avg": "\u01FE234DA avg"

以下是我的尝试

#include <boost/regex.hpp>
#include <string>
#include <iostream>
#include <fstream>

using namespace std;
const int BUFSIZE = 500;

int main(int argc, char** argv) {

    if (argc < 2) {
        cout << "Pass the input file" << endl;
        exit(0);
    }

   boost::regex re("\\+");
   string file(argv[1]);
   char buf[BUFSIZE];

   boost::regex uni("\\u+");


   ifstream in(file.c_str());
   while (!in.eof())
   {
      in.getline(buf, BUFSIZE-1);
      if (boost::regex_search(buf, re))
      {
          cout << buf << endl;
          cout << "(\) found" << endl;
          if (boost::regex_search(buf, uni)) {
              cout << buf << endl;
              cout << "unicode found" << endl;

          }

      }

   }
}

现在，当我使用上面的代码时，它会打印以下内容

"ID": "\u01FE234DA - this is id ",
 (\) found
"ID": "\u01FE234DA - this is id ",
 unicode found
"speed": "96\/78",
 (\) found
"avg": "\u01FE234DA avg"
 (\) found
 "avg": "\u01FE234DA avg"
 unicode found

我不想关注

 "ID": "\u01FE234DA - this is id ",
 unicode found
"speed": "96\/78",
 (\) found
 "avg": "\u01FE234DA avg"
 (\) and unicode found

我认为代码无法分别区分“\”和“\u”，但我不确定在哪里更改什么。

Answer 1

尝试在您的第一个正则表达式中使用 [^u] 来匹配任何不是 u 的字符。

boost::regex re("\\[^u]");  // matches \ not followed by u
boost::regex uni("\\u");  // matches \u

最好使用一个正则表达式。

boost:regex re("\\(u)?"); // matches \ with or without u

然后检查部分匹配 m[1] 是否为 'u':

m = boost::regex_search(buf, uni)
if (m && m[1] === "u") {  // pseudo-code
    // unicode
}
else {
    // not unicode
}

最好使用正则表达式进行模式匹配。它们看起来更复杂，但一旦您习惯了它们，它们实际上更容易维护，并且比一次迭代一个字符的字符串更不容易出错。

如何在 C++ 中使用 boost 正则表达式解析转义元素 '\' 和 unicode 字符 '\u'

How to parse escape element '\' and unicode character '\u' using boost regex in C++

c++

boost-regex