PHP preg_split 在空间上,但不在标签内
PHP preg_split on spaces, but not within tags
我正在使用 preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line);
并在 phpliveregex.com 上使用 运行
它产生数组:
array(10
0=><b>test</b>
1=>or
2=><em>oh
3=>yeah</em>
4=>and
5=><i>
6=>oh
7=>yeah
8=></i>
9=>"ye we 'hold' it"
)
不是我想要的,它应该只在 html 标签之外用空格分隔,如下所示:
array(5
0=><b>test</b>
1=>or
2=><em>oh yeah</em>
3=>and
4=><i>oh yeah</i>
5=>"ye we 'hold' it"
)
在此正则表达式中,我只能在 "double quote" 中添加例外,但确实需要帮助才能添加更多内容,例如标记 <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
关于正则表达式如何工作的任何解释也很感激。
描述
不要使用拆分命令,只需匹配您想要的部分
<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/>)|(?:"[^"]*"|[^"<]*)*
例子
现场演示
https://regex101.com/r/bK8iL3/1
示例文本
注意第二段中的困难边缘情况
<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"
some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
样本匹配
MATCH 1
0. [0-11] `<b>test</b>`
MATCH 2
0. [11-15] ` or `
MATCH 3
0. [15-38] `<strong> this </strong>`
MATCH 4
0. [38-56] `<em> oh yeah </em>`
MATCH 5
0. [56-61] ` and `
MATCH 6
0. [61-75] `<i>oh yeah</i>`
MATCH 7
0. [75-111] ` Here we are "ye we 'hold' it" some`
MATCH 8
0. [111-117] `<img/>`
MATCH 9
0. [117-121] `gfsf`
MATCH 10
0. [121-213] `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`
MATCH 11
0. [213-224] `<pre></pre>`
MATCH 12
0. [224-237] `<code></code>`
MATCH 13
0. [237-254] `<strong></strong>`
MATCH 14
0. [254-261] `<b></b>`
MATCH 15
0. [261-270] `<em></em>`
MATCH 16
0. [270-277] `<i></i>`
说明
NODE EXPLANATION
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
img 'img'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\/] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to :
----------------------------------------------------------------------
a 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pre 'pre'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
code 'code'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
strong 'strong'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
b 'b'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
em 'em'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
) end of
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
what was matched by capture
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^"<]* any character except: '"', '<' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
使用 DOMDocument
更容易,因为您不需要描述什么是 html 标签及其外观。您只需要检查 nodeType。当它是一个textNode时,用preg_match_all
拆分它 (这比为preg_split
设计一个模式更方便):
$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);
$nodeList = $dom->documentElement->childNodes;
$results = [];
foreach ($nodeList as $childNode) {
if ($childNode->nodeType == XML_TEXT_NODE &&
preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
$results = array_merge($results, $m[0]);
else
$results[] = $dom->saveHTML($childNode);
}
print_r($results);
注意:我选择了双引号部分未闭合时的默认行为(没有闭合引号),请随意更改它。
注2:有时LIBXML_
常量没有定义。你可以解决这个问题之前测试它并在需要时定义它:
if (!defined('LIBXML_HTML_NOIMPLIED'))
define('LIBXML_HTML_NOIMPLIED', 8192);
我正在使用 preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line);
并在 phpliveregex.com 上使用 运行
它产生数组:
array(10
0=><b>test</b>
1=>or
2=><em>oh
3=>yeah</em>
4=>and
5=><i>
6=>oh
7=>yeah
8=></i>
9=>"ye we 'hold' it"
)
不是我想要的,它应该只在 html 标签之外用空格分隔,如下所示:
array(5
0=><b>test</b>
1=>or
2=><em>oh yeah</em>
3=>and
4=><i>oh yeah</i>
5=>"ye we 'hold' it"
)
在此正则表达式中,我只能在 "double quote" 中添加例外,但确实需要帮助才能添加更多内容,例如标记 <img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
关于正则表达式如何工作的任何解释也很感激。
描述
不要使用拆分命令,只需匹配您想要的部分
<(?:(?:img)(?=[\s>\/])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>|(a|span|pre|code|strong|b|em|i)(?=[\s>\])(?:[^>=]|=(?:'[^']*'|"[^"]*"|[^'"\s>]*))*\s?\/?>.*?<\/>)|(?:"[^"]*"|[^"<]*)*
例子
现场演示
https://regex101.com/r/bK8iL3/1
示例文本
注意第二段中的困难边缘情况
<b>test</b> or <strong> this </strong><em> oh yeah </em> and <i>oh yeah</i> Here we are "ye we 'hold' it"
some<img/>gfsf<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>
样本匹配
MATCH 1
0. [0-11] `<b>test</b>`
MATCH 2
0. [11-15] ` or `
MATCH 3
0. [15-38] `<strong> this </strong>`
MATCH 4
0. [38-56] `<em> oh yeah </em>`
MATCH 5
0. [56-61] ` and `
MATCH 6
0. [61-75] `<i>oh yeah</i>`
MATCH 7
0. [75-111] ` Here we are "ye we 'hold' it" some`
MATCH 8
0. [111-117] `<img/>`
MATCH 9
0. [117-121] `gfsf`
MATCH 10
0. [121-213] `<a html="droids.html" onmouseover=' var x=" Not the droid I am looking for " ; '>droides</a>`
MATCH 11
0. [213-224] `<pre></pre>`
MATCH 12
0. [224-237] `<code></code>`
MATCH 13
0. [237-254] `<strong></strong>`
MATCH 14
0. [254-261] `<b></b>`
MATCH 15
0. [261-270] `<em></em>`
MATCH 16
0. [270-277] `<i></i>`
说明
NODE EXPLANATION
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
img 'img'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\/] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\/'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
( group and capture to :
----------------------------------------------------------------------
a 'a'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
span 'span'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
pre 'pre'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
code 'code'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
strong 'strong'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
b 'b'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
em 'em'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
i 'i'
----------------------------------------------------------------------
) end of
----------------------------------------------------------------------
(?= look ahead to see if there is:
----------------------------------------------------------------------
[\s>\] any character of: whitespace (\n, \r,
\t, \f, and " "), '>', '\'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
----------------------------------------------------------------------
[^>=] any character except: '>', '='
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
= '='
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
[^']* any character except: ''' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
' '\''
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^'"\s>]* any character except: ''', '"',
whitespace (\n, \r, \t, \f, and "
"), '>' (0 or more times (matching
the most amount possible))
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
\s? whitespace (\n, \r, \t, \f, and " ")
(optional (matching the most amount
possible))
----------------------------------------------------------------------
\/? '/' (optional (matching the most amount
possible))
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
.*? any character (0 or more times (matching
the least amount possible))
----------------------------------------------------------------------
< '<'
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
what was matched by capture
----------------------------------------------------------------------
> '>'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
(?: group, but do not capture (0 or more times
(matching the most amount possible)):
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
[^"]* any character except: '"' (0 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
" '"'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
[^"<]* any character except: '"', '<' (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
)* end of grouping
----------------------------------------------------------------------
使用 DOMDocument
更容易,因为您不需要描述什么是 html 标签及其外观。您只需要检查 nodeType。当它是一个textNode时,用preg_match_all
拆分它 (这比为preg_split
设计一个模式更方便):
$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';
$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);
$nodeList = $dom->documentElement->childNodes;
$results = [];
foreach ($nodeList as $childNode) {
if ($childNode->nodeType == XML_TEXT_NODE &&
preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
$results = array_merge($results, $m[0]);
else
$results[] = $dom->saveHTML($childNode);
}
print_r($results);
注意:我选择了双引号部分未闭合时的默认行为(没有闭合引号),请随意更改它。
注2:有时LIBXML_
常量没有定义。你可以解决这个问题之前测试它并在需要时定义它:
if (!defined('LIBXML_HTML_NOIMPLIED'))
define('LIBXML_HTML_NOIMPLIED', 8192);