C++ 中的正则表达式字符 class 减法

Question

我正在编写一个 C++ 程序，它需要采用 XML 架构文件中定义的正则表达式并使用它们来验证 XML 数据。问题是，XML Schemas 使用的正则表达式风格似乎在 C++ 中不受直接支持。

例如，有几个特殊字符 classes \i 和 \c 默认情况下未定义，并且 XML Schema regex 语言支持某些东西称为“character class subtraction”，似乎在 C++ 中不受支持。

允许使用 \i 和 \c 特殊字符 classes 非常简单，我可以在正则中查找“\i”或“\c”表达式并用扩展版本替换它们，但是让字符 class 减法工作是一个更令人生畏的问题...

例如，这个在 XML 模式定义中有效的正则表达式在 C++ 中抛出一个异常，表示它有不平衡的方括号。

#include <iostream>
#include <regex>

int main()
{
    try
    {
        // Match any lowercase letter that is not a vowel
        std::regex rx("[a-z-[aeiuo]]");
    }
    catch (const std::regex_error& ex)
    {
        std::cout << ex.what() << std::endl;
    }
}

如何让 C++ 识别正则表达式中的字符 class 减法？或者更好的是，有没有办法直接在 C++ 中使用正则表达式的 XML 架构风格？

Answer 1

字符范围减法或交集在 std::regex 支持的任何语法中都不可用，因此您必须将表达式重写为受支持的语法之一。

最简单的方法是自己执行减法并将集合传递给 std::regex，例如 [bcdfghjklvmnpqrstvwxyz] 您的示例。

另一种解决方案是寻找功能更强大的正则表达式引擎或支持 XML 架构及其正则表达式语言的专用 XML 库。

Answer 2

从cppreference examples开始

#include <iostream>
#include <regex>
 
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}
 
int main()
{
    // greedy match, repeats [a-z] 4 times
    show_matches("abcdefghi", "(?:(?![aeiou])[a-z]){2,4}");
}

您可以测试和查看正则表达式的详细信息here。

选择使用非捕获组 (?: ...) 是为了防止它更改您的组，以防您在更大的正则表达式中使用它。

(?![aeiou]) 将在不消耗输入的情况下匹配，如果发现一个字符不匹配 [aeiou]，[a-z] 将匹配字母。结合这两个条件相当于你的性格class减法。

{2,4} 是一个量词，表示从 2 到 4，也可以是 + 表示一个或多个，* 表示零个或多个。

编辑

看了另一个回答的评论我知道你想支持XMLSchema。

下一个程序展示了如何使用 ECMA 正则表达式将“字符 class 差异”转换为 ECMA 兼容格式。

#include <iostream>
#include <regex>
#include <string>
#include <vector>

std::string translated_regex(const std::string &pattern){
    // pattern to identify character class subtraction
    std::regex class_subtraction_re(
       "\[((?:\\[\[\]]|[^[\]])*)-\[((?:\\[\[\]]|[^[\]])*)\]\]"
    );
    // translate the regular expression to ECMA compatible
    std::string translated = std::regex_replace(pattern, 
       class_subtraction_re, "(?:(?![])[])");
    return translated;
}
void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    std::string re = translated_regex("[a-z-[aeiou]]{2,4}");
    show_matches("abcdefghi", re);
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translated_regex(test) << '\n'; 
    }
    
    return 0;
}

编辑：递归和命名字符 classes

上述方法不适用于递归字符 class 取反。并且没有办法只使用正则表达式来处理递归替换。这使得解决方案远没有那么直接。

解决方案有以下级别

一个函数扫描 [
当找到 [ 时，有一个函数可以在找到 '-[` 时递归地处理字符 classes。
模式 \p{xxxxx} 被单独处理以识别命名字符模式。命名的classes定义在specialCharClass映射中，我填两个例子。

#include <iostream>
#include <regex>
#include <string>
#include <vector>
#include <map>

std::map<std::string, std::string> specialCharClass = {
    {"IsDigit", "0-9"},
    {"IsBasicLatin", "a-zA-Z"}
    // Feel free to add the character classes you want
};

const std::string getCharClassByName(const std::string &pattern, size_t &pos){
    std::string key;
    while(++pos < pattern.size() && pattern[pos] != '}'){
        key += pattern[pos];
    }
    ++pos;
    return specialCharClass[key];
}

std::string translate_char_class(const std::string &pattern, size_t &pos){
    
    std::string positive;
    std::string negative;
    if(pattern[pos] != '['){
        return "";
    }
    ++pos;
    
    while(pos < pattern.size()){
        if(pattern[pos] == ']'){
            ++pos;
            if(negative.size() != 0){
                return "(?:(?!" + negative + ")[" + positive + "])";
            }else{
                return "[" + positive + "]";
            }
        }else if(pattern[pos] == '\'){
            if(pos + 3 < pattern.size() && pattern[pos+1] == 'p'){
                positive += getCharClassByName(pattern, pos += 2);
            }else{
                positive += pattern[pos++];
                positive += pattern[pos++];
            }
        }else if(pattern[pos] == '-' && pos + 1 < pattern.size() && pattern[pos+1] == '['){
            if(negative.size() == 0){
                negative = translate_char_class(pattern, ++pos);
            }else{
                negative += '|';
                negative = translate_char_class(pattern, ++pos);
            }
        }else{
            positive += pattern[pos++];
        }
    }
    return '[' + positive; // there is an error pass, forward it
}

std::string translate_regex(const std::string &pattern, size_t pos = 0){
    std::string r;
    while(pos < pattern.size()){
        if(pattern[pos] == '\'){
            r += pattern[pos++];
            r += pattern[pos++];
        }else if(pattern[pos] == '['){
            r += translate_char_class(pattern, pos);
        }else{
            r += pattern[pos++];
        }
    }
    return r;
}

void show_matches(const std::string& in, const std::string& re)
{
    std::smatch m;
    std::regex_search(in, m, std::regex(re));
    if(m.empty()) {
        std::cout << "input=[" << in << "], regex=[" << re << "]: NO MATCH\n";
    } else {
        std::cout << "input=[" << in << "], regex=[" << re << "]: ";
        std::cout << "prefix=[" << m.prefix() << "] ";
        for(std::size_t n = 0; n < m.size(); ++n)
            std::cout << " m[" << n << "]=[" << m[n] << "] ";
        std::cout << "suffix=[" << m.suffix() << "]\n";
    }
}



int main()
{
    std::vector<std::string> tests = {
        "[a]",
        "[a-z]d",
        "[\p{IsBasicLatin}-[\p{IsDigit}-[89]]]",
        "[a-z-[aeiou]]{2,4}",
        "[a-z-[aeiou-[e]]]",
        "Some text [0-9-[4]] suffix", 
        "([abcde-[ae]])",
        "[a-z-[aei]]|[A-Z-[OU]] "
    };
    
    for(std::string test : tests){
       std::cout << " " << test << '\n' 
        << "   -- " << translate_regex(test) << '\n'; 
        // Construct a reegx (validate syntax)
        std::regex(translate_regex(test)); 
    }
    std::string re = translate_regex("[a-z-[aeiou-[e]]]{2,10}");
    show_matches("abcdefghi", re);
    
    return 0;
}

Answer 3

尝试使用支持 XPath 的库中的库函数，例如 libxml 中的 xmlregexp（是 C 库），它可以处理XML 正则表达式并将它们直接应用于 XML

http://www.xmlsoft.org/html/libxml-xmlregexp.html#xmlRegexp

----> http://web.mit.edu/outland/share/doc/libxml2-2.4.30/html/libxml-xmlregexp.html <----

替代方案可能是 PugiXML（C++ 库，What XML parser should I use in C++?）但是我认为它没有实现 XML 正则表达式功能...

Answer 4

好的，在看完其他答案后，我尝试了一些不同的东西，最终使用了 libxml2 的 xmlRegexp 功能。

xmlRegexp 相关函数的文档非常少，所以我想我会 post 在这里举个例子，因为其他人可能会觉得它有用：

#include <iostream>
#include <libxml/xmlregexp.h>

int main()
{
    LIBXML_TEST_VERSION;

    xmlChar* str = xmlCharStrdup("bcdfg");
    xmlChar* pattern = xmlCharStrdup("[a-z-[aeiou]]+");
    xmlRegexp* regex = xmlRegexpCompile(pattern);

    if (xmlRegexpExec(regex, str) == 1)
    {
        std::cout << "Match!" << std::endl;
    }

    free(regex);
    free(pattern);
    free(str);
}

输出：

匹配！

我还尝试使用 Xerces-C++ 库中的 XMLString::patternMatch，但它似乎没有在下面使用 XML 模式兼容的正则表达式引擎。（老实说，我不知道它在下面使用什么正则表达式引擎，而且它的文档非常糟糕，我在网上找不到任何例子，所以我就放弃了。）

C++ 中的正则表达式字符 class 减法

Regex character class subtraction in C++

c++

regex

xsd

character-class

c++17

编辑

编辑：递归和命名字符 classes