跳过转义终止符的 Nom 解析器

Question

我已经检查了关于 nom 解析器组合器问题的其他 SO 答案，但似乎还没有人问过这个问题。

我正在尝试解析定界的正则表达式，它们将始终用 /...../ 定界，可能在末尾使用修饰符（对于 all 我的数据need to parse right now is out of scope.) however 如果字符串中间有转义的 \/，我的解析器会提前停止，在第一个 / 即使前面有 \.

我有这个解析器：

use nom::bytes::complete::{tag, take_until};
use nom::{combinator::map_res, sequence::tuple, IResult};
use regex::Regex;

pub fn regex(input: &str) -> IResult<&str, Regex> {
    map_res(
        tuple((tag("/"), take_until("/"), tag("/"))),
        |(_, re, _)| Regex::new(re),
    )(input)
}

自然地 take_until 停在第一个 / 而没有注意到前一个字符是 \，我查看了 peek 和 recognize、map 和一大堆其他东西，但我只是简而言之，我觉得我确实想要 take_until("/") 具有某种编码意识，或者只是..我是无论如何，使用 map_res 移交给 Rust 的 regex 板条箱来进行解析。

我也使用 escaped 组合器尝试过类似的操作，但示例有些不清楚，我无法使其工作：

pub fn regex(input: &str) -> IResult<&str, Regex> {
    map_res(
        tuple((
            tag("/"),
            escaped(many1(anychar), '\', one_of(r"/")),
            tag("/"),
        )),
        |(_, re, _)| {
            println!("mapres {}", re);
            Regex::new(re)
        },
    )(input)
}

我的测试用例是这样的（.unwrap().as_str()只是一个小例子，因为regex::Regex没有实现PartialEq）：

#[cfg(test)]
mod tests {
    use super::regex;
    use super::Regex;
    #[test]
    fn test_parse_regex_simple() {
        assert_eq!(
            Regex::new(r#"hello world"#).unwrap().as_str(),
            regex("/hello world/").unwrap().1.as_str()
        );
    }
    #[test]
    fn test_parse_regex_with_escaped_forwardslash() {
        assert_eq!(
            Regex::new(r#"hello /world"#).unwrap().as_str(),
            regex(r"/hello \/world/").unwrap().1.as_str(),
        );
    }
}

Answer 1

作为第一个参数传递给 escaped() 的解析器应该解析一个字符，该字符 不是转义字符 ，和停在正确的字符上。 many1(anychar) 不回答这些条件中的任何一个。

相反，你应该这样称呼它：

escaped(none_of(r"\/"), '\', one_of(r"/"))

或整个表达式：

map_res(
    tuple((
        tag("/"),
        escaped(none_of(r"\/"), '\', one_of(r"/")),
        tag("/"),
    )),
    |(_, re, _)| Regex::new(re),
)(input)

但是没用。因为Regex's escape sequences don't include /. So you need to strip the escape characters. Luckily, escaped_transform()是来帮你的：

map_res(
    tuple((
        tag("/"),
        escaped_transform(none_of(r"\/"), '\', one_of(r"/")),
        tag("/"),
    )),
    |(_, re, _)| Regex::new(&re), // We need a little `&` here because `escape_transform()` returns a `String` but `Regex::new()` wants `&str`
)(input)

Answer 2

Chayim Friedman 接受的答案是正确的，但是我能够扩展它也可以处理 \w \d 和其他此类修饰符，因此，它只是 Chayim 在 escaped_transform版本：


pub fn regex(input: &str) -> IResult<&str, Regex> {
    map_res(
        delimited(
            tag("/"),
            escaped_transform(
                none_of("\/"),
                '\',
                alt((
                    value(r"/", tag("/")),
                    value(r"\d", tag("d")),
                    value(r"\W", tag("W")),
                    value(r"\w", tag("w")),
                    value(r"\b", tag("b")),
                    value(r"\B", tag("B")),
                )),
            ),
            tag("/"),
        ),
        |re| Regex::new(&re),
    )(input)
}

请注意，此列表也不完整，但 https://docs.rs/regex/1.5.6/regex/#escape-sequences gives a complete set of escapes, and https://github.com/Geal/nom/blob/main/examples/string.rs 对如何处理 \u{....} 类型的转义序列给出了更详细的解释。

跳过转义终止符的 Nom 解析器

Nom parser that skips escaped terminator characters

regex

parsing

parser-combinators

rust