[php,也许是正则表达式]如何删除所有字符串,除了 [/] + [4 个或更多数字的序列] (/1111)
[php, maybe regex]how to remove all strings, except [/] + [sequence of 4 or more numbers] (/1111)
我在变量中存储了一个大字符串(大源代码页),我希望删除所有内容,除了
href="HERE"
中的值
像这样:href="/45214"
重要的是只有具有这种格式的值被保留:只有一个/ + 数字,4个或更多个数字的序列
预期输出:
/45214
我认为是这样的:
'/href=\"(\/)[0-9]/'
$source = '</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">health</span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/7487423'"></span>
</a>';
preg_match_all("/href=\"(\/)[0-9]/", $source, $results);
var_export(end($results));
预期输出:
/45214
/7487423
谢谢
使用href=\"(\/)[0-9]{4,}
正则表达式,{4,}
确保捕获4个或更多连续数字。
参见示例 https://regex101.com/r/BlKv9L/1/
$re = '/href=\"(\/)[0-9]{4,}/m';
$str = ' <a href="/45214" title="care with your skin against
<a href="/452143232" title="care with your skin against
<a href="/214" title="care with your skin against
<a href="/543543545214" title="care with your skin against
<a href="/45215434" title="care with your skin against
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
<(([^<>"]+"){2})*[^<>"]*href="\K[^"]+
您可以使用 DOMDocument
提取所有 href
属性值,然后使用匹配的简单 '~^/\d{4,}$~'
正则表达式检查每个属性值
^
- 字符串开头
/
- 斜杠
\d{4,}
- 4 位以上
$
- 字符串结尾。
PHP代码:
$html = "YOUR_HTML_CODE";
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$results = [];
foreach ($xpath->query('//*/@href') as $val) {
if (preg_match('~^/\d{4,}$~', $val->value)) {
array_push($results, $val->value);
}
}
print_r($results);
输出:
Array
(
[0] => /45214
[1] => /7487423
)
参见PHP demo。
尽管 OP 要求 PHP 解决方案,因为它涉及 HTML,您也可以使用 JavaScript 和正则表达式,如下所示:
var d = document;
d.g = d.getElementsByTagName;
var aTags = d.g("a");
var matches = [];
var re = /\/\d{4,}/;
for (var i=0, max = aTags.length; i <= max - 1; i++) {
matches[i] = re.exec(aTags[i].href);
}
d.body.innerHTML="";
console.log(matches);
</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/7487423'"></span>
</a>
刮板系列:
您可以通过对标记解析安全的正则表达式以有效的方式使用 preg_match_all()
。
这个的好处是如果格式不正确也不会出错html
它不会在不可见的内容(如评论等)中寻找它。
PHP运行代码
http://sandbox.onlinephpfunctions.com/code/a182a6d57e887d44f9040166cf57fbb3486bb183
<?php
$string = ' HTML ';
preg_match_all
(
'~(?si)(?:<[\w:]+(?=(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)href\s*=\s*(?:([\'"])\s*(/\d{4,})\s*))\s+(?:".*?"|\'.*?\'|[^>]*?)+>\K|<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>".*?"|\'.*?\'|(?:(?!/>)[^>])?)+)?\s*>).*?</\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:".*?"|\'.*?\'|[^>]?)+\s*/?)|\?.*?\?|(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?))))>(*SKIP)(?!))~',
$string,
$matches,
PREG_PATTERN_ORDER
);
print_r( $matches[2] );
输出
Array
(
[0] => /45214
[1] => /7487423
)
正则表达式解释
(?si) # Modifier, dot-all and ignore case
(?:
# What we want to examine, any tag with href attribute
< [\w:]+
(?= # Assertion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
(?<= \s )
href \s* = \s* # href attribute
(?:
( ['"] ) # (1), # quote begin
\s*
( # (2 start)
/ \d{4,} # /dddd (slash, 4 or more digits) to be saved
) # (2 end)
\s*
# quote end
)
)
\s+
(?: " .*? " | ' .*? ' | [^>]*? )+
>
\K # Don't store this match, we already have capture group 2 value
|
# OR,
# Match, but skip these (this just advances the current position)
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (3 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (3 end)
(?:
\s+
(?>
" .*? "
| ' .*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
.*? </ \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" .*? "
| ' .*? '
| [^>]?
)+
\s* /?
)
| \? .*? \?
| (?:
!
(?:
(?: DOCTYPE .*? )
| (?: \[CDATA\[ .*? \]\] )
| (?: -- .*? -- )
| (?: ATTLIST .*? )
| (?: ENTITY .*? )
| (?: ELEMENT .*? )
)
)
)
>
(*SKIP)
(?!)
)
我在变量中存储了一个大字符串(大源代码页),我希望删除所有内容,除了
href="HERE"
像这样:href="/45214"
重要的是只有具有这种格式的值被保留:只有一个/ + 数字,4个或更多个数字的序列
预期输出:
/45214
我认为是这样的:
'/href=\"(\/)[0-9]/'
$source = '</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">health</span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href = '/7487423'"></span>
</a>';
preg_match_all("/href=\"(\/)[0-9]/", $source, $results);
var_export(end($results));
预期输出:
/45214
/7487423
谢谢
使用href=\"(\/)[0-9]{4,}
正则表达式,{4,}
确保捕获4个或更多连续数字。
参见示例 https://regex101.com/r/BlKv9L/1/
$re = '/href=\"(\/)[0-9]{4,}/m';
$str = ' <a href="/45214" title="care with your skin against
<a href="/452143232" title="care with your skin against
<a href="/214" title="care with your skin against
<a href="/543543545214" title="care with your skin against
<a href="/45215434" title="care with your skin against
';
preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);
// Print the entire match result
var_dump($matches);
<(([^<>"]+"){2})*[^<>"]*href="\K[^"]+
您可以使用 DOMDocument
提取所有 href
属性值,然后使用匹配的简单 '~^/\d{4,}$~'
正则表达式检查每个属性值
^
- 字符串开头/
- 斜杠\d{4,}
- 4 位以上$
- 字符串结尾。
PHP代码:
$html = "YOUR_HTML_CODE";
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$results = [];
foreach ($xpath->query('//*/@href') as $val) {
if (preg_match('~^/\d{4,}$~', $val->value)) {
array_push($results, $val->value);
}
}
print_r($results);
输出:
Array
(
[0] => /45214
[1] => /7487423
)
参见PHP demo。
尽管 OP 要求 PHP 解决方案,因为它涉及 HTML,您也可以使用 JavaScript 和正则表达式,如下所示:
var d = document;
d.g = d.getElementsByTagName;
var aTags = d.g("a");
var matches = [];
var re = /\/\d{4,}/;
for (var i=0, max = aTags.length; i <= max - 1; i++) {
matches[i] = re.exec(aTags[i].href);
}
d.body.innerHTML="";
console.log(matches);
</li>
<li >
<div class="widget-post-holder">
<a href="/45214" title="care with your skin against
pollution" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="care with your skin against pollution"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/45214'"></span>
</a>
</li>
<li >
<div class="widget-post-holder">
<a href="/7487423" title="natural hair straightening" class="post-thumb" >
<span class="post-cont">
health </span>
<div class="librLoaderLine"></div>
<img title="natural hair straightening"
id="0045214"
class="te lazy js-postPreview"
data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
alt="care with your skin against pollution" />
<span class="hd-post" onclick="window.location.href ='/7487423'"></span>
</a>
刮板系列:
您可以通过对标记解析安全的正则表达式以有效的方式使用 preg_match_all()
。
这个的好处是如果格式不正确也不会出错html
它不会在不可见的内容(如评论等)中寻找它。
PHP运行代码
http://sandbox.onlinephpfunctions.com/code/a182a6d57e887d44f9040166cf57fbb3486bb183
<?php
$string = ' HTML ';
preg_match_all
(
'~(?si)(?:<[\w:]+(?=(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)href\s*=\s*(?:([\'"])\s*(/\d{4,})\s*))\s+(?:".*?"|\'.*?\'|[^>]*?)+>\K|<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>".*?"|\'.*?\'|(?:(?!/>)[^>])?)+)?\s*>).*?</\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:".*?"|\'.*?\'|[^>]?)+\s*/?)|\?.*?\?|(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?))))>(*SKIP)(?!))~',
$string,
$matches,
PREG_PATTERN_ORDER
);
print_r( $matches[2] );
输出
Array
(
[0] => /45214
[1] => /7487423
)
正则表达式解释
(?si) # Modifier, dot-all and ignore case
(?:
# What we want to examine, any tag with href attribute
< [\w:]+
(?= # Assertion (a pseudo atomic group)
(?: [^>"'] | " [^"]* " | ' [^']* ' )*?
(?<= \s )
href \s* = \s* # href attribute
(?:
( ['"] ) # (1), # quote begin
\s*
( # (2 start)
/ \d{4,} # /dddd (slash, 4 or more digits) to be saved
) # (2 end)
\s*
# quote end
)
)
\s+
(?: " .*? " | ' .*? ' | [^>]*? )+
>
\K # Don't store this match, we already have capture group 2 value
|
# OR,
# Match, but skip these (this just advances the current position)
<
(?:
(?:
(?:
# Invisible content; end tag req'd
( # (3 start)
script
| style
| object
| embed
| applet
| noframes
| noscript
| noembed
) # (3 end)
(?:
\s+
(?>
" .*? "
| ' .*? '
| (?:
(?! /> )
[^>]
)?
)+
)?
\s* >
)
.*? </ \s*
(?= > )
)
| (?: /? [\w:]+ \s* /? )
| (?:
[\w:]+
\s+
(?:
" .*? "
| ' .*? '
| [^>]?
)+
\s* /?
)
| \? .*? \?
| (?:
!
(?:
(?: DOCTYPE .*? )
| (?: \[CDATA\[ .*? \]\] )
| (?: -- .*? -- )
| (?: ATTLIST .*? )
| (?: ENTITY .*? )
| (?: ELEMENT .*? )
)
)
)
>
(*SKIP)
(?!)
)