从 HTML 个文本中提取特定文本

Question

我不太熟悉正则表达式。我正在尝试获得底部描述的结果。这是我到目前为止所做的（注意 $page 包含制表符）：

$page = "<div class=\"title-container\">
                            <h1>Text here<span> /Sub-text/</span> </h1>
                                                     </div>";
// TITLE
preg_match_all ('/<h1>(.*)<\/h1>/U', $page, $out);
$hutitle = preg_replace("#<span>(.*)<\/span>\s#", "", $out[1][0]);

$entitle = preg_replace("'(.*)<span> /'", "", $out[1][0]);

我想得到这个：

$hutitle = "Text here"; 
$entitle = "Sub-text"; (Without html and "/")

Answer 1

试试这个

<h1>(.*?)<span> /(.*?)/</span>

$1 和 $2 是您预期的结果。

Answer 2

我建议将 DOM 与 trim 一起使用，不需要正则表达式，这里是您的具体案例的工作代码：

$page = "<div class=\"title-container\">\n                            <h1>Text here<span> /Sub-text/</span> </h1>\n                                                     </div>";

$dom = new DOMDocument;
$dom->loadHTML($page);
$hs = $dom->getElementsByTagName('h1');
foreach ($hs as $h) {
    $enttitlenodes = $h->getElementsByTagName('span');
    if ($enttitlenodes->length > 0 && $enttitlenodes->item(0)->tagName == 'span')
    {
        $entitle = trim($enttitlenodes->item(0)->nodeValue, " /");
        echo $entitle . "\n";
        $h->removeChild($enttitlenodes->item(0)); 
    }
    $hutitle = $h->nodeValue;
    echo $hutitle;
}

见IDEONE demo

从 HTML 个文本中提取特定文本

Extracting specific text from HTML texts

php

regex

string

preg-match-all