段落中第一句的 Xpath 表达式

Xpath expression for first sentence in a pagraph

我正在为段落中的第一句话寻找 Xpath 表达式。

<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>

结果应该是:

A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions.

我尝试了一些方法都无济于事。

$expression = '/html/body/div/div/div/div/p//text()';

我需要使用://p[ends-with 还是 substring-before

您将无法通过 XPath 解析自然语言,但您可以按如下方式获取包括第一个句点在内的子字符串:

substring(/p,1,string-length(substring-before(/p,"."))+1)

请注意,这可能不是 "first sentence" 如果在第一个句子结束之前有句号的缩写或其他词汇出现,如果第一个句子以其他形式的标点符号结束,等等。


或者,更简洁:

concat(substring-before(/p, "."), ".")

来源: ThW 评论中的聪明想法。

这里没有在 Xpath 级别上执行此操作的真正好方法。 PHP 只有 Xpath 1.0 并且只支持基本的字符串操作。没有什么可以考虑 locale/language 的。然而 PHP 本身在 ext/intl.

中有一些东西

所以使用DOM+Xpath作为字符串获取段落元素节点的文本内容,并从中提取第一句。

IntlBreakIterator可以根据locale/language特定的规则拆分字符串。

$html = <<<'HTML'
<p>
A federal agency is recommending that White House adviser Kellyanne Conway be 
removed from federal service saying she violated the Hatch Act on numerous 
occasions. The office is unrelated to Robert Mueller and his investigation.
</p>
HTML;

$document = new DOMDocument();
$document->loadXML($html);
$xpath = new DOMXpath($document);

// fetch the first paragraph in the document as string
$summary = $xpath->evaluate('string((//p)[1])');
// create a break iterator for en_US sentences.
$breaker = IntlBreakIterator::createSentenceInstance('en_US');
// replace line breaks with spaces before feeding it to the breaker
$breaker->setText(str_replace(["\r\n", "\n"], '', $summary));

$firstSentence = '';
// iterate the sentences
foreach ($breaker->getPartsIterator() as $sentence) {
  $firstSentence = $sentence;
  // break after the first sentence
  break;
}

var_dump($firstSentence);

输出:

string(164) "A federal agency is recommending that White House adviser Kellyanne Conway be removed from federal service saying she violated the Hatch Act on numerous occasions. "

此外 DOMXpath 允许您注册 PHP 函数并从 Xpath 表达式调用它们。如果您需要 Xpath 级别的逻辑(在条件下使用它们),这是可能的。