正则表达式删除 <a and </a> 标签之间的所有标记，[ 和 ] 内除外

Question

试图找出正则表达式让我脑抽筋:)

我正在用 WordPress post 内容中的单个短代码替换数千个 hreflinks，使用允许我运行正则表达式的插件内容。

我不是尝试将 SQL 查询与 RegEx 结合起来，而是分两个阶段进行：首先 SQL 到 find/replace 每个人 URL到个人简码，第二阶段，删除其余的“href”link标记。

这些是我现在从第一步开始的一些例子；如您所见，URL 已替换为 [nggallery id=xxx] 简码。

<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>

现在，我需要删除前导 <a 和结尾之间的所有 href link 标记 - span、img 等</a>，只留下简码 [nggallery id=xxx]。

我从这里开始：https://www.regex101.com/r/rL8wP1/2

但我不知道如何防止 [nggallery id=xxx] 简码被正则表达式捕获。

2015 年 7 月 9 日更新

@nhahtdh 的回答似乎很完美，不太贪心，也不会吃掉相邻的 html links。使用 ( 和 ) 作为分隔符，使用 </code> 作为 WordPress 中正则表达式插件的替代品。（如果使用 BBEdit，则需要使用 <code>）

( <a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a> )

更新 7/02/2015

感谢 Fab Sa （下面的回答），他的正则表达式 https://www.regex101.com/r/rL8wP1/4

<a.*(\[nggallery[^\]+]*\]).*?<\/a>

在 regex101 模拟器中工作，但是当在 BBEdit 文本编辑器或运行s regex 的 WordPress 插件中使用时，他的 regex 删除 [nggallery id=***] 简码。那么是不是太贪心了？其他问题？

2015 年 7 月 1 日更新：

我知道，我知道，回复：RegEx match open tags except XHTML self-contained tags你不能用正则表达式

解析HTML

Answer 1

你可以使用这个正则表达式

<a.*(\[nggallery[^\]+]*\]).*?<\/a>

全局（标记 g）。此正则表达式将匹配 link 并保存 [nggallery ...] 部分。您可以用 $1 替换所有匹配项以保留保存的 [nggallery ...] 部分。

我已经在线更新了你的正则表达式：https://www.regex101.com/r/rL8wP1/4

PS：在这个解决方案中 [nggallery ...] 不需要像 href 这样的特定属性。如果你想强制这样做，你可以使用 <a.*href\="(\[nggallery[^\]+]*\])".*?<\/a>

Answer 2

这个怎么样？

(?<=nggallery\sid=xx]">).*(?=<\/a>)

使用全局和单行作为修饰符（-g 和 -s）。这匹配 <a href="[nggallery id=xx]"> 和 </a> 之间的所有内容。我不确定我是否正确理解了你的问题......但是这个 RegEx 做了我刚才描述的。

Answer 3

当一行中有多个 <a> 标签时，Fab Sa 的正则表达式 <a.*(\[nggallery[^\]+]*\]).*?<\/a> 会吞噬一切，因为开头的 .* 不受限制，它将匹配不同的 <a> 标签。

通过限制允许的字符，你可以稍微匹配你想要的：

<a\s[^>]*"(\[nggallery[^\]]*\])".*?<\/a>
  ^^^^^^^

我在 a 之后强制至少有一个空格，以确保它不匹配其他一些标签，加上一些额外的限制。

无论如何，如果您发现它在某些极端情况下不起作用，那您就只能靠自己了。用正则表达式操作 HTML 通常是个坏主意。

Answer 4

/<a\b[^>]*href\s*=\s*"(\[nggallery id=[^"]+\])".*?<\/a>/i

这会将短代码 [nggallery id=XXX] 放入第 1 组，然后用第 1 组的内容替换匹配项。

注意： 这假设格式合理 HTML，适用通常的免责声明。

Answer 5

有点晚了，但我想我会把它混在一起。
（注意-警告！！这可能很难看..）

已修改：用于 BBEdit。
注意 - BBEdit 使用 PCRE 引擎。可以找到 BBEdit 正则表达式构造
这里：https://gist.github.com/ccstone/5385334

Formatted:

 # (?s)(<a(?=\s)(?>(?:(?<=\s)href\s*=\s*"\s*(\[nggallery\s+id\s*=\s*[^"\]>]*?\])"|".*?"|'.*?'|[^>]*?)+>)(?<!/>)(?(2)|(?!))).*?</a\s*>

 (?s)
 (                             # (1 start), Capture open a tag
      <a                            # Open a tag
      (?= \s )
      (?>                           # Atomic
           (?:
                (?<= \s )
                href \s* = \s*                # href attribute
                "
                \s* 
                (                             # (2 start), Capture shortcode value
                     \[nggallery \s+ 
                     id \s* = \s* [^"\]>]*? 
                     \]
                )                             # (2 end)
                "
             |  " .*? "
             |  ' .*? '
             |  [^>]*? 
           )+
           >
      )
      (?<! /> )                     # Not a self contained closure
      (?(2)                         # Only a tags with href attr, shortcode value
        |  (?!)
      )
 )                             # (1 end)
 .*?                           # Stuff inbetween
 </a \s* >                     # Close a tag

输出：

 **  Grp 0 -  ( pos 0 , len 240 ) 
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>  
 **  Grp 1 -  ( pos 0 , len 28 ) 
<a href="[nggallery id=xx]">  
 **  Grp 2 -  ( pos 9 , len 17 ) 
[nggallery id=xx]  
----------------
 **  Grp 0 -  ( pos 244 , len 46 ) 
<a href="[nggallery id=xxxxx]">Click here!</a>  
 **  Grp 1 -  ( pos 244 , len 31 ) 
<a href="[nggallery id=xxxxx]">  
 **  Grp 2 -  ( pos 253 , len 20 ) 
[nggallery id=xxxxx]  
-----------------
 **  Grp 0 -  ( pos 294 , len 90 ) 
<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>  
 **  Grp 1 -  ( pos 294 , len 65 ) 
<a title="title title" href="[nggallery id=xxx]" target="_blank">  
 **  Grp 2 -  ( pos 323 , len 18 ) 
[nggallery id=xxx]

Answer 6

这是一个与您的示例完美匹配的正则表达式。

(<a.*?href=")|([^\]]*?<\/a>)

我没有尝试一次匹配整个表达式，而是使用 OR 运算符指定两个单独的正则表达式，一个用于 a 标记的开头，一个用于 <a.*?href=" 标记的结尾a 标签 [^\]]*?<\/a>。这在单个替换操作中可能有效，也可能无效，如果不行，将其拆分为两个替换操作，第一个运行用于结束标记正则表达式的，然后运行用于开始标记的.如果您有任何其他示例打破此答案，请告诉我。

Answer 7

我不知道您为什么要使用正则表达式执行此操作，而它可以使用 JavaScript DOM 操作来完成。

我会告诉你基本的方法，给你一个想法：

var div = document.createElement('div');
div.innerHTML = yourString;
var a = div.querySelector('a');
document.body.innerHTML = a.attributes[0].nodeValue;

Working Fiddle

同时勾选 documentFragment

Answer 8

由于您没有指定，我假设没有嵌套的锚标记，您只是想提取其中的方括号代码。我还假设您的代码的标识格式是“[nggallery”。

使用这个查找

<\s*a(?=\s|>)[^>]*?(\[nggallery[^\]]+\])[^>]*>(.|\n)+?(<\s*\/\s*a\s*>)

使用

替换

（这应该是第一个为 BBEdit 捕获的组符号）

Answer 9

是的，您无法使用正则表达式解析 html，如何使用简约的词法分析器使行为变得防弹？它会给你更多的灵活性和对你的代码的控制。

<?php

$src = <<<EOF
<a href="[nggallery id=xx]"><span class="shutterset">
<img class="alignnone size-large wp-image-23067" title="Image Title" 
src="http://example.com/wp-content/uploads/2015/06/image-title.jpg"
alt="" width="685" height="456" /></span></a>

<a href="[nggallery id=xxxxx]">Click here!</a>

<a title="title title" href="[nggallery id=xxx]" target="_blank">Title Link Title Link</a>
EOF;

// we "eat up" the source string by opening <a> tags, closing <a> tags or text
$tokens = array();
while ($src){
    // check if $src begins with this pattern <a (any optional prop)[nggallery (any string)] (any optional prop)>
    if (preg_match('/^<a [^>]*(\[nggallery [^\]]+\])[^>]*>/s', $src, $match)){
        // here you can handle data with more flexibility
        // you can grab the id or the [placeholder] via 
        //$match[1] = [nggallery id=xyz]

        // we store the chunk of string and label it as an opening tag
        $tokens[] = array('type' => 'OPENING_A', 'value' => $match[0]);
    }else if (preg_match('/^<\/a>/s', $src, $match)){
        // we store the chunk of string and label it as a closing tag
        $tokens[] = array('type' => 'CLOSING_A', 'value' => $match[0]);
    }else if (preg_match('/^./s', $src, $match)){
        // we store the chunk of string, in this case a character and label it as text
        $tokens[] = array('type' => 'TEXT', 'value' => $match[0]);
    }
    // finally we remove the identified pattern from the source string
    // and continue "eating it up"
    $src = substr($src, strlen($match[0]));
}

// once the source string has been consumed, we get this array
// var_dump($tokens);
// array (size=247)
//   0 => 
//     array (size=2)
//       'type' => string 'OPENING_A' (length=9)
//       'value' => string '<a href="[nggallery id=xx]">' (length=28)
//   1 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string '<' (length=1)
//   2 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string 's' (length=1)
//   3 => 
//     array (size=2)
//       'type' => string 'TEXT' (length=4)
//       'value' => string 'p' (length=1)
//       ... ommited for brevity


// now with all the parsed data, we can rebuild the html
// as needed
$html = '';
// we keep a flag to now if we are inside a tag
// marked with ngggallery
$insideNGGalleryTag = false;

foreach ($tokens as $token){
    if ($token['type'] == 'OPENING_A'){
        $insideNGGalleryTag = true;
        $html .= $token['value'];
    }else if ($token['type'] == 'CLOSING_A'){
        $insideNGGalleryTag = false;
        $html .= $token['value'];
    }else{
        // if we are inside a nggallery tag, we will ignore
        // all text inside it. here you could also remove
        // html properties from the tag, move the [nggallery placeholder]
        // inside the <a> or some other behavior you might need
        if (!$insideNGGalleryTag){
            $html .= $token['value'];
        }
    }
}

// finally echo or write to file the
// modified html, in this case it would return
var_dump($html);
// <a href="[nggallery id=xx]"></a>
// <a href="[nggallery id=xxxxx]"></a>
// <a title="title title" href="[nggallery id=xxx]" target="_blank"></a>

正则表达式删除 <a and </a> 标签之间的所有标记，[ 和 ] 内除外

RegEx to remove all markup between <a and </a> tags except for within [ and ]

html

regex

html-parsing