PHP 伪标签后版本

Question

我正在做一个项目，涉及从 Internet 下载的数万个文件。页面的来源（MO 政府）没有很好地对页面进行编程。

我正在从页面中提取某些元素以放入另一个页面以便在我的网站中更好地引用。 这是一个工作示例：

<div id="intsect">
    <strong>Common law in force--effect on statutes.</strong>
</div>


// PHP CODE
// Load Document
    $doc = new DOMDocument();
// Load File Values
    @$doc->loadHTMLFile("stathtml/" . $file);

// Load the <div id="intsect"></div> value
    $genAssem = $doc->getElementById("intsect");
// Appropriate value
    $genAssem = "&nbsp;&nbsp;&nbsp;&nbsp;<b>Statute Name: </b>" . $genAssem->textContent . "<br><br>";

# VALUE (example)
    Statute Name: Common law in force--effect on statutes.

这是让我很生气的部分：

<div id="intsect">
    <strong>Common law in force--effect on statutes.</strong>
</div>

<!-- THIS PART -->
<p> 1.035.  Whenever the word "voter" is used in the laws of this state it shall mean registered voter, or legal voter.

程序员没有给它 ID 或 Class。我需要提取 #intsect 之后的段落标记。 是否有 PHP select 或者可以 select #intsect 标签之后的 <p></p> 标签？

Answer 1

您可以使用 xpath 定位 <p> 标记，该标记具有 div 的前一个同级标记，其 ID 为 intsect:

$doc = new DOMDocument;
@$doc->loadHTMLFile("stathtml/" . $file);
$xpath = new DOMXpath($doc);
$p = $xpath->query('//p[preceding-sibling::div[@id="intsect"]]');
if($p->length > 0) {
    echo $p->item(0)->nodeValue;
}

Sample Output

PHP 伪标签后版本

PHP version of pseudo after tag

html

php

domdocument