[php,也许是正则表达式]如何删除所有字符串,除了 [/] + [4 个或更多数字的序列] (/1111)

[php, maybe regex]how to remove all strings, except [/] + [sequence of 4 or more numbers] (/1111)

我在变量中存储了一个大字符串(大源代码页),我希望删除所有内容,除了 href="HERE"

中的值

像这样:href="/45214"

重要的是只有具有这种格式的值被保留:只有一个/ + 数字,4个或更多个数字的序列

预期输出:

/45214

我认为是这样的: '/href=\"(\/)[0-9]/'

$source = '</li>
<li >
    <div class="widget-post-holder">

        <a href="/45214" title="care with your skin against 
           pollution" class="post-thumb" >

            <span class="post-cont">
                health            </span>
            <div class="librLoaderLine"></div>
            <img title="care with your skin against pollution"
                 id="0045214"
                 class="te lazy   js-postPreview"
                 data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
                 src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
                 data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
                 alt="care with your skin against pollution" />
            <span class="hd-post" onclick="window.location.href = '/45214'"></span>

        </a>
</li>
<li >
    <div class="widget-post-holder">
        <a href="/7487423" title="natural hair straightening" class="post-thumb" >
            <span class="post-cont">health</span>
            <div class="librLoaderLine"></div>
            <img title="natural hair straightening"
                 id="0045214"
                 class="te lazy   js-postPreview"
                 data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
                 src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
                 data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
                 alt="care with your skin against pollution" />
            <span class="hd-post" onclick="window.location.href = '/7487423'"></span>
        </a>';

preg_match_all("/href=\"(\/)[0-9]/", $source, $results);
var_export(end($results));

预期输出:

/45214
/7487423

谢谢

使用href=\"(\/)[0-9]{4,}正则表达式,{4,}确保捕获4个或更多连续数字。

参见示例 https://regex101.com/r/BlKv9L/1/

$re = '/href=\"(\/)[0-9]{4,}/m';
$str = '    <a href="/45214" title="care with your skin against 

    <a href="/452143232" title="care with your skin against 

    <a href="/214" title="care with your skin against 

    <a href="/543543545214" title="care with your skin against 
    <a href="/45215434" title="care with your skin against 
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

// Print the entire match result
var_dump($matches);

can be checked here

<(([^<>"]+"){2})*[^<>"]*href="\K[^"]+

您可以使用 DOMDocument 提取所有 href 属性值,然后使用匹配的简单 '~^/\d{4,}$~' 正则表达式检查每个属性值

  • ^ - 字符串开头
  • / - 斜杠
  • \d{4,} - 4 位以上
  • $ - 字符串结尾。

PHP代码:

$html = "YOUR_HTML_CODE";
$dom   = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$results = [];
foreach ($xpath->query('//*/@href') as $val) {
    if (preg_match('~^/\d{4,}$~', $val->value)) {
        array_push($results, $val->value);
    }
}
print_r($results);

输出:

Array
(
    [0] => /45214
    [1] => /7487423
)

参见PHP demo

尽管 OP 要求 PHP 解决方案,因为它涉及 HTML,您也可以使用 JavaScript 和正则表达式,如下所示:

var d = document;
d.g = d.getElementsByTagName;

var aTags = d.g("a");

var matches = [];

var re = /\/\d{4,}/;

for (var i=0, max = aTags.length; i <= max - 1; i++) {
   matches[i] = re.exec(aTags[i].href);
}
  
  
d.body.innerHTML="";
console.log(matches);
</li>
    <li >
        <div class="widget-post-holder">

    <a href="/45214" title="care with your skin against 
pollution" class="post-thumb" >

                <span class="post-cont">
                                    health            </span>
                                <div class="librLoaderLine"></div>
                            <img title="care with your skin against pollution"
                     id="0045214"
                     class="te lazy   js-postPreview"
                     data-src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
            src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
                                         data-libr="https://healthandc.com/media/posts/201105/23/45214/libr_225k_45214.webm"
                                      alt="care with your skin against pollution" />
                                <span class="hd-post" onclick="window.location.href ='/45214'"></span>

                                                </a>
                                                </li>
    <li >
        <div class="widget-post-holder">
            <a href="/7487423" title="natural hair straightening" class="post-thumb" >
                <span class="post-cont">
                                    health            </span>
                                <div class="librLoaderLine"></div>
                            <img title="natural hair straightening"
                     id="0045214"
                     class="te lazy   js-postPreview"
                     data-src="https://wemedic.com/media/posts/201105/23/7487423/original/14.jpg"
                   src="https://wemedic.com/media/posts/201105/23/45214/original/14.jpg"
                                         data-libr="https://healthandc.com/media/posts/201105/23/7487423/libr_225k_7487423.webm"
                                      alt="care with your skin against pollution" />
                                <span class="hd-post" onclick="window.location.href ='/7487423'"></span>

                                                </a>

刮板系列:
您可以通过对标记解析安全的正则表达式以有效的方式使用 preg_match_all()
这个的好处是如果格式不正确也不会出错html
它不会在不可见的内容(如评论等)中寻找它。

PHP运行代码

http://sandbox.onlinephpfunctions.com/code/a182a6d57e887d44f9040166cf57fbb3486bb183

<?php
 $string = ' HTML ';

 preg_match_all
    (
        '~(?si)(?:<[\w:]+(?=(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)href\s*=\s*(?:([\'"])\s*(/\d{4,})\s*))\s+(?:".*?"|\'.*?\'|[^>]*?)+>\K|<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>".*?"|\'.*?\'|(?:(?!/>)[^>])?)+)?\s*>).*?</\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:".*?"|\'.*?\'|[^>]?)+\s*/?)|\?.*?\?|(?:!(?:(?:DOCTYPE.*?)|(?:\[CDATA\[.*?\]\])|(?:--.*?--)|(?:ATTLIST.*?)|(?:ENTITY.*?)|(?:ELEMENT.*?))))>(*SKIP)(?!))~',
        $string,
        $matches,
        PREG_PATTERN_ORDER
    );

print_r( $matches[2] );

输出

Array
(
    [0] => /45214
    [1] => /7487423
)

正则表达式解释

 (?si)                         # Modifier, dot-all and ignore case
 (?:
      # What we want to examine, any tag with href attribute
      < [\w:]+ 
      (?=                        # Assertion (a pseudo atomic group)
           (?: [^>"'] | " [^"]* " | ' [^']* ' )*?
           (?<= \s )
           href \s* = \s*                # href attribute
           (?:
                ( ['"] )                      # (1), # quote begin
                \s* 
                (                             # (2 start)
                     / \d{4,}                      # /dddd (slash, 4 or more digits) to be saved
                )                             # (2 end)
                \s* 
                                            # quote end
           )
      )
      \s+ 
      (?: " .*? " | ' .*? ' | [^>]*? )+
      >
      \K                            # Don't store this match, we already have capture group 2 value

   |  
      # OR,
      # Match, but skip these (this just advances the current position)
      <
      (?:
           (?:
                (?:
                     # Invisible content; end tag req'd
                     (                             # (3 start)
                          script
                       |  style
                       |  object
                       |  embed
                       |  applet
                       |  noframes
                       |  noscript
                       |  noembed 
                     )                             # (3 end)
                     (?:
                          \s+ 
                          (?>
                               " .*? "
                            |  ' .*? '
                            |  (?:
                                    (?! /> )
                                    [^>] 
                               )?
                          )+
                     )?
                     \s* >
                )

                .*? </  \s* 
                (?= > )
           )

        |  (?: /? [\w:]+ \s* /? )
        |  (?:
                [\w:]+ 
                \s+ 
                (?:
                     " .*? " 
                  |  ' .*? ' 
                  |  [^>]? 
                )+
                \s* /?
           )
        |  \? .*? \?
        |  (?:
                !
                (?:
                     (?: DOCTYPE .*? )
                  |  (?: \[CDATA\[ .*? \]\] )
                  |  (?: -- .*? -- )
                  |  (?: ATTLIST .*? )
                  |  (?: ENTITY .*? )
                  |  (?: ELEMENT .*? )
                )
           )
      )
      >
      (*SKIP)                      
      (?!)
 )