从 html 文档中符合条件的 span 标签获取 class 值和文本
Get class value and text from qualifying span tags in html document
请帮助我为 preg_match_all
设计以下模式
如何更改我的模式以获得所需的输出?
在字符串中搜索名称为 class 的标签,例如“email_
”(email_
OR email_p_12
OR email_22
OR email_xx
)
获取标签之间的文本<span class=" xx email_xx xx "> THE EMAIL ADDRESS </span>
获取 class 以 'email_'
开头的名称
这是我的模式:$pattern = '~<span class=\"((.*?)*)*(email_(.*?))?(.*?)\">(.*?)</span>~';
我需要的是这样的数组:
Array
(
[0] => Array
(
[mail] => labore@et.de
[class] => email_p_14
)
[1] => Array
(
[mail] => esse@cillum.de
[class] => email_p_22
)
[2] => Array
(
[mail] => anim@id.de
[class] => email_
)
[3] => Array
(
[mail] => laboris@nisi.de
[class] => email_
)
)
文件:
<?php
$string = '
<p>
Lorem ipsum dolor sit amet,
consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut
<span class=" red email_p_14">labore@et.de</span>
dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea consequat.
Duis aute irure in reprehenderit in voluptate velit
<span class="email_p_22">esse@cillum.de</span>
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit
<span class="blue email_ green">anim@id.de</span>
laborum. Donec elementum ligula.
Quis nostrud exercitation ullamco
<span class="blue email_ green black">laboris@nisi.de</span>
aliquip ex ea consequat.
</p>';
/* Looking for these:
<span class=" red email_p_14">labore@et.de</span>
<span class="email_p_22">esse@cillum.de</span>
<span class="blue email_ green">anim@id.de</span>
<span class="blue email_ green black">laboris@nisi.de</span>
*/
$pattern = '~<span class=\"((.*?)*)*(email_(.*?))?(.*?)\">(.*?)</span>~';
preg_match_all($pattern, $string, $m);
$clean_array = array_filter(array_map('array_filter', $m));
ksort($clean_array);
$output = Array();
foreach($clean_array as $row) {
foreach($row as $key => $val){
$output[$key][]=$val;
}
}
print("<pre>".print_r($output,true)."</pre>");
这是我得到的:
Array
(
[0] => Array
(
[0] => labore@et.de
[1] => red email_p_14
[2] => labore@et.de
)
[1] => Array
(
[0] => esse@cillum.de
[1] => email_
[2] => p_22
[3] => esse@cillum.de
)
[2] => Array
(
[0] => anim@id.de
[1] => blue email_ green
[2] => anim@id.de
)
[3] => Array
(
[0] => laboris@nisi.de
[1] => blue email_ green black
[2] => laboris@nisi.de
)
)
我需要的是这样的数组:
Array
(
[0] => Array
(
[mail] => labore@et.de
[class] => email_p_14
)
[1] => Array
(
[mail] => esse@cillum.de
[class] => email_p_22
)
[2] => Array
(
[mail] => anim@id.de
[class] => email_
)
[3] => Array
(
[mail] => laboris@nisi.de
[class] => email_
)
)
*/
对于 class 值,您使用此模式 ((.*?)*)*(email_(.*?))?(.*?)
,它使用重复捕获组的组合,其中所有实际上都是可选的。
对于您使用 (.*?)
的电子邮件地址,它匹配任何非贪婪的字符并且不匹配类似模式的电子邮件。
您可以使用命名捕获组来获取密钥 mail
和 class
:
<span[^<>]*\bclass="[^"]*(?<class>email_[^\s"]*)[^"]*">\h*(?<mail>[^\s@]+@[^\s@]+)\h*<\/span>
在结果中,删除数字键:
$re = '`<span[^<>]*\bclass="[^"]*(?<class>email_[^\s"]*)[^"]*">\h*(?<mail>[^\s@]+@[^\s@]+)\h*<\/span>`';
$str = '<span class=" xx email_p_14 xx "> labore@et.de </span>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r(array_filter($matches[0], function ($k) { return !is_numeric($k); }, ARRAY_FILTER_USE_KEY));
输出
Array
(
[class] => email_p_14
[mail] => labore@et.de
)
您还可以查看 DOMDocument,找到名称以 email_ 开头的 class 跨度,然后匹配该跨度的值以匹配电子邮件地址模式。
然后您可以使用键和值构建数组。
例如
$str = '<span class=" xx email_p_14 xx "> labore@et.de </span>';
$dom = new DomDocument();
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$doc = new DOMXPath($dom);
$items = $doc->query("//span[contains(@class, 'email_')]");
foreach ($items as $item) {
$class = array_filter(explode(' ', $item->getAttribute('class')), function($x) {
return substr( $x, 0, 6 ) === "email_";
});
print_r($class);
echo $item->nodeValue;
}
输出
Array
(
[2] => email_p_14
)
labore@et.de
使用 DOMDocument 和 XPath 解析 html。一旦您确定了适当的节点,挖掘并提取数据,然后将新的子数组推送到结果中。
代码:(Demo)
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query("//span[starts-with(@class, 'email_') or contains(@class, ' email_')]") as $span) {
$result[] = [
'mail' => $span->nodeValue,
'class' => preg_replace(
'~.*\b(email_\S*).*~',
'',
$span->getAttribute('class')
)
];
}
var_export($result);
输出:
array (
0 =>
array (
'mail' => 'labore@et.de',
'class' => 'email_p_14',
),
1 =>
array (
'mail' => 'esse@cillum.de',
'class' => 'email_p_22',
),
2 =>
array (
'mail' => 'anim@id.de',
'class' => 'email_',
),
3 =>
array (
'mail' => 'laboris@nisi.de',
'class' => 'email_',
),
)
请帮助我为 preg_match_all
设计以下模式如何更改我的模式以获得所需的输出?
在字符串中搜索名称为 class 的标签,例如“email_
”(email_
OR email_p_12
OR email_22
OR email_xx
)
获取标签之间的文本<span class=" xx email_xx xx "> THE EMAIL ADDRESS </span>
获取 class 以 'email_'
开头的名称这是我的模式:$pattern = '~<span class=\"((.*?)*)*(email_(.*?))?(.*?)\">(.*?)</span>~';
我需要的是这样的数组:
Array
(
[0] => Array
(
[mail] => labore@et.de
[class] => email_p_14
)
[1] => Array
(
[mail] => esse@cillum.de
[class] => email_p_22
)
[2] => Array
(
[mail] => anim@id.de
[class] => email_
)
[3] => Array
(
[mail] => laboris@nisi.de
[class] => email_
)
)
文件:
<?php
$string = '
<p>
Lorem ipsum dolor sit amet,
consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut
<span class=" red email_p_14">labore@et.de</span>
dolore magna aliqua. Ut enim ad minim veniam,
quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea consequat.
Duis aute irure in reprehenderit in voluptate velit
<span class="email_p_22">esse@cillum.de</span>
dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit
<span class="blue email_ green">anim@id.de</span>
laborum. Donec elementum ligula.
Quis nostrud exercitation ullamco
<span class="blue email_ green black">laboris@nisi.de</span>
aliquip ex ea consequat.
</p>';
/* Looking for these:
<span class=" red email_p_14">labore@et.de</span>
<span class="email_p_22">esse@cillum.de</span>
<span class="blue email_ green">anim@id.de</span>
<span class="blue email_ green black">laboris@nisi.de</span>
*/
$pattern = '~<span class=\"((.*?)*)*(email_(.*?))?(.*?)\">(.*?)</span>~';
preg_match_all($pattern, $string, $m);
$clean_array = array_filter(array_map('array_filter', $m));
ksort($clean_array);
$output = Array();
foreach($clean_array as $row) {
foreach($row as $key => $val){
$output[$key][]=$val;
}
}
print("<pre>".print_r($output,true)."</pre>");
这是我得到的:
Array
(
[0] => Array
(
[0] => labore@et.de
[1] => red email_p_14
[2] => labore@et.de
)
[1] => Array
(
[0] => esse@cillum.de
[1] => email_
[2] => p_22
[3] => esse@cillum.de
)
[2] => Array
(
[0] => anim@id.de
[1] => blue email_ green
[2] => anim@id.de
)
[3] => Array
(
[0] => laboris@nisi.de
[1] => blue email_ green black
[2] => laboris@nisi.de
)
)
我需要的是这样的数组:
Array
(
[0] => Array
(
[mail] => labore@et.de
[class] => email_p_14
)
[1] => Array
(
[mail] => esse@cillum.de
[class] => email_p_22
)
[2] => Array
(
[mail] => anim@id.de
[class] => email_
)
[3] => Array
(
[mail] => laboris@nisi.de
[class] => email_
)
)
*/
对于 class 值,您使用此模式 ((.*?)*)*(email_(.*?))?(.*?)
,它使用重复捕获组的组合,其中所有实际上都是可选的。
对于您使用 (.*?)
的电子邮件地址,它匹配任何非贪婪的字符并且不匹配类似模式的电子邮件。
您可以使用命名捕获组来获取密钥 mail
和 class
:
<span[^<>]*\bclass="[^"]*(?<class>email_[^\s"]*)[^"]*">\h*(?<mail>[^\s@]+@[^\s@]+)\h*<\/span>
在结果中,删除数字键:
$re = '`<span[^<>]*\bclass="[^"]*(?<class>email_[^\s"]*)[^"]*">\h*(?<mail>[^\s@]+@[^\s@]+)\h*<\/span>`';
$str = '<span class=" xx email_p_14 xx "> labore@et.de </span>';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
print_r(array_filter($matches[0], function ($k) { return !is_numeric($k); }, ARRAY_FILTER_USE_KEY));
输出
Array
(
[class] => email_p_14
[mail] => labore@et.de
)
您还可以查看 DOMDocument,找到名称以 email_ 开头的 class 跨度,然后匹配该跨度的值以匹配电子邮件地址模式。
然后您可以使用键和值构建数组。
例如
$str = '<span class=" xx email_p_14 xx "> labore@et.de </span>';
$dom = new DomDocument();
$dom->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$doc = new DOMXPath($dom);
$items = $doc->query("//span[contains(@class, 'email_')]");
foreach ($items as $item) {
$class = array_filter(explode(' ', $item->getAttribute('class')), function($x) {
return substr( $x, 0, 6 ) === "email_";
});
print_r($class);
echo $item->nodeValue;
}
输出
Array
(
[2] => email_p_14
)
labore@et.de
使用 DOMDocument 和 XPath 解析 html。一旦您确定了适当的节点,挖掘并提取数据,然后将新的子数组推送到结果中。
代码:(Demo)
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($string);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->query("//span[starts-with(@class, 'email_') or contains(@class, ' email_')]") as $span) {
$result[] = [
'mail' => $span->nodeValue,
'class' => preg_replace(
'~.*\b(email_\S*).*~',
'',
$span->getAttribute('class')
)
];
}
var_export($result);
输出:
array (
0 =>
array (
'mail' => 'labore@et.de',
'class' => 'email_p_14',
),
1 =>
array (
'mail' => 'esse@cillum.de',
'class' => 'email_p_22',
),
2 =>
array (
'mail' => 'anim@id.de',
'class' => 'email_',
),
3 =>
array (
'mail' => 'laboris@nisi.de',
'class' => 'email_',
),
)