PHP preg_split 在空间上，但不在标签内

Question

我正在使用 preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line); 并在 phpliveregex.com 上使用运行它产生数组：

array(10
  0=><b>test</b>
  1=>or
  2=><em>oh
  3=>yeah</em>
  4=>and
  5=><i>
  6=>oh
  7=>yeah
  8=></i>
  9=>"ye we 'hold' it"
)

不是我想要的，它应该只在 html 标签之外用空格分隔，如下所示：

array(5
  0=><b>test</b>
  1=>or
  2=><em>oh yeah</em>
  3=>and
  4=><i>oh yeah</i>
  5=>"ye we 'hold' it"
)

在此正则表达式中，我只能在 "double quote" 中添加例外，但确实需要帮助才能添加更多内容，例如标记 <img/><a></a><pre></pre><code></code>

关于正则表达式如何工作的任何解释也很感激。

Answer 1

描述

不要使用拆分命令，只需匹配您想要的部分

<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/>)|(?:"[^"]*"|[^"<]*)*

例子

现场演示

https://regex101.com/r/bK8iL3/1

示例文本

注意第二段中的困难边缘情况

<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"

some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>

样本匹配

MATCH 1
0.  [0-11]  `<b>test</b>`

MATCH 2
0.  [11-15] ` or `

MATCH 3
0.  [15-38] `<strong> this </strong>`

MATCH 4
0.  [38-56] `<em> oh yeah </em>`

MATCH 5
0.  [56-61] ` and `

MATCH 6
0.  [61-75] `<i>oh yeah</i>`

MATCH 7
0.  [75-111]    ` Here we are "ye we 'hold' it" some`

MATCH 8
0.  [111-117]   `<img/>`

MATCH 9
0.  [117-121]   `gfsf`

MATCH 10
0.  [121-213]   `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`

MATCH 11
0.  [213-224]   `<pre></pre>`

MATCH 12
0.  [224-237]   `<code></code>`

MATCH 13
0.  [237-254]   `<strong></strong>`

MATCH 14
0.  [254-261]   `<b></b>`

MATCH 15
0.  [261-270]   `<em></em>`

MATCH 16
0.  [270-277]   `<i></i>`

说明

NODE                     EXPLANATION
----------------------------------------------------------------------
  <                        '<'
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      img                      'img'
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [\s>\/]                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^'"\s>]*                any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and "
                                 "), '>' (0 or more times (matching
                                 the most amount possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/?                      '/' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    (                        group and capture to :
----------------------------------------------------------------------
      a                        'a'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      span                     'span'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      pre                      'pre'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      code                     'code'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      strong                   'strong'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      b                        'b'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      em                       'em'
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      i                        'i'
----------------------------------------------------------------------
    )                        end of 
----------------------------------------------------------------------
    (?=                      look ahead to see if there is:
----------------------------------------------------------------------
      [\s>\]                  any character of: whitespace (\n, \r,
                               \t, \f, and " "), '>', '\'
----------------------------------------------------------------------
    )                        end of look-ahead
----------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
----------------------------------------------------------------------
      [^>=]                    any character except: '>', '='
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      =                        '='
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
        [^']*                    any character except: ''' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        '                        '\''
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
        [^"]*                    any character except: '"' (0 or more
                                 times (matching the most amount
                                 possible))
----------------------------------------------------------------------
        "                        '"'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        [^'"\s>]*                any character except: ''', '"',
                                 whitespace (\n, \r, \t, \f, and "
                                 "), '>' (0 or more times (matching
                                 the most amount possible))
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
    )*                       end of grouping
----------------------------------------------------------------------
    \s?                      whitespace (\n, \r, \t, \f, and " ")
                             (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    \/?                      '/' (optional (matching the most amount
                             possible))
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
    .*?                      any character (0 or more times (matching
                             the least amount possible))
----------------------------------------------------------------------
    <                        '<'
----------------------------------------------------------------------
    \/                       '/'
----------------------------------------------------------------------
                           what was matched by capture 
----------------------------------------------------------------------
    >                        '>'
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  (?:                      group, but do not capture (0 or more times
                           (matching the most amount possible)):
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [^"]*                    any character except: '"' (0 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
    [^"<]*                   any character except: '"', '<' (0 or
                             more times (matching the most amount
                             possible))
----------------------------------------------------------------------
  )*                       end of grouping
----------------------------------------------------------------------

Answer 2

使用 DOMDocument 更容易，因为您不需要描述什么是 html 标签及其外观。您只需要检查 nodeType。当它是一个textNode时，用preg_match_all拆分它 （这比为preg_split设计一个模式更方便）：

$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';

$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);

$nodeList = $dom->documentElement->childNodes;

$results = [];

foreach ($nodeList as $childNode) {
    if ($childNode->nodeType == XML_TEXT_NODE &&
        preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
        $results = array_merge($results, $m[0]);
    else
        $results[] = $dom->saveHTML($childNode);
}

print_r($results);

注意：我选择了双引号部分未闭合时的默认行为（没有闭合引号），请随意更改它。

注2：有时LIBXML_常量没有定义。你可以解决这个问题之前测试它并在需要时定义它：

if (!defined('LIBXML_HTML_NOIMPLIED'))
    define('LIBXML_HTML_NOIMPLIED', 8192);

PHP preg_split 在空间上，但不在标签内

PHP preg_split on spaces, but not within tags

html

php

regex

preg-split

描述

例子

说明