容忍 RegEx 中的某些字符

Tolerate certain characters in RegEx

我正在编写一个能够(除其他外)解析 links 的消息格式化解析器。这种特定情况需要解析 <url|linkname> 中的 link 并将该文本仅替换为 linkname。这里的问题是 urllinkname 可能包含也可能不包含 </code> 或 <code> 字符在任何地方以任何顺序(虽然每个最多一个)。我想匹配模式但保留 "invalid" 字符。这个问题针对 linkname 自行解决,因为模式的那部分只是 ([^\n+]),但是 url 片段与更复杂的模式匹配,更具体地说是 URL 验证来自 is.js 的模式。手动修改整个模式以在任何地方都容忍 [] 并不是一件容易的事,我需要模式来保留这些字符,因为它们用于跟踪目的(所以我不能简单地只是 .replace(/|/g, "") 之前匹配)。

如果这种匹配是不可能的,是否有一些自动化的方法来可靠地修改正则表达式以在每个字符匹配之间添加 []{0,2},将 </code> 添加到所有 <code>[chars] 匹配等

这是取自 is.jsurl 模式:

/(?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?/i

此模式已针对我的目的和 <url|linkname> 格式进行了调整,如下所示:

let namedUrlRegex = /<((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)\|([^\n]+)>/ig;

使用代码在这里:JSFiddle

说明示例(...表示上面的namedUrlRegex变量,</code>是捕获<code>linkname的捕获组:

Current behavior:
"<google.com|Google>".replace(..., "") // "<google.com|Google>" WRONG
"<google.com|Google>".replace(..., "") // "Google"              CORRECT
"<not_a_url|Google>".replace(..., "") // "<not_a_url|Google>"   CORRECT

Expected behavior:
"<google.com|Google>".replace(..., "") // "Google" (note there is no )
"<google.com|Google>".replace(..., "") // "Google"
"<not_a_url|Google>".replace(..., "") // "<not_a_url|Google>"

Note the same rules for </code> apply to <code>, </code>, <code>..., ... etc

Context: This is used to normalize a string from a WYSIWYG editor to the length/content that it will display as, preserving the location of the current selection (denoted by </code> and <code> so it can be restored after parsing). If the "caret" is removed completely (e.g. if the cursor was in the URL of a link), it will select the whole string instead. Everything works as expected, except for when the selection starts or ends in the url fragment.

Edit for clarification: I only want to change a segment in a string if it follows the format of <url|linkname> where url matches the URL pattern (tolerating </code>, <code>) and linkname consists of non-\n characters. If this condition is not met within a <...|...> string, it should be left unaltered as per the not_a_url example above.

我最终制作了一个匹配表达式中所有 "symbols" 的正则表达式。其中一个怪癖是它期望 :=! 字符被转义,即使在 (?:...)(?=...)、[=16= 之外] 表达。这是通过在处理之前转义它们来解决的。

Fiddle

let r = /(\.|\[.+?\]|\w|[^\\/\[\]\^$\(\)\?\*\+\{\}\|\+\:\=\!]|(\{.+?\}))(?:((?:\{.+?\}|\+|\*)\??)|\??)/g;

let url = /((?:(?:https?|ftp):\/\/)?(?:(?!(?:10|127)(?:\.\d{1,3}){3})(?!(?:169\.254|192\.168)(?:\.\d{1,3}){2})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d{1,3}){2})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d{1,2}|2[0-4]\d|25[0-5])){2}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]{2,})))(?::\d{2,5})?(?:\/\S*)?)/

function tolerate(regex, insert) {
    let first = true;
        // convert to string
    return regex.toString().replace(/\/(.+)\//, "").
        // escape :=!
        replace(/((?:^|[^\])\(?:\)*\(\?|[^?])([:=!]+)/g, (m, g1, g2) => g1 + (g2.split("").join("\"))).
        // substitute string
        replace(r, function(m, g1, g2, g3, g4) {
            // g2 = {...} multiplier (to prevent matching digits as symbols)
            if (g2) return m;
            // g3 = multiplier after symbol (must wrap in parenthesis to preserve behavior)
            if (g3) return "(?:" + insert + g1 + ")" + g3;
            // prevent matching tolerated characters at beginning, remove to change this behavior
            if (first) {
                first = false;
                return m;
            }
            // insert the insert
            return insert + m;
        }
    );
}

alert(tolerate(url, "??"));