如何在 PHP 简单 HTML DOM 解析器中格式化明文？

Question

我正在尝试以纯文本形式提取网页内容 - 没有 html 标签。这是一些示例代码：

$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;

问题是我在 $result['body'] 中得到的内容非常混乱。 HTML 确实被删除了，但是句子经常合并到其他句子中，因为没有空格或句点来分隔来自一个 HTML 标签的文本结束位置和来自以下标签的文本开始位置。

一个例子：

<body>
    <div class="H2">Header</div>
    <div class="P">this is a paragraph</div>
    <div class="P">this is another paragraph</div>
</body>

结果：

"Headerthis is a paragraphthis is another paragraph"

想要的结果：

"Header. this is a paragraph. this is another paragraph"

在使用明文实现清晰的句子分隔符之前，是否有任何方法可以从明文格式化结果或者对内部文本应用额外的操作？

编辑：

我正在考虑做这样的事情：

foreach($dom->find('div') as $element) {
    $text = $element->plaintext;
    $result['body'] .= $text.'. ';
}

但是嵌套 div 时会出现问题，因为它会添加父项的内容，其中包括所有子项的文本，然后添加子项的内容，有效地复制文本。这可以简单地通过检查 $text 中是否有 </div> 来解决。

也许我应该试试 callbacks。

Answer 1

试试这个代码：

$result = array();
foreach($html->find('div') as $e){
    $result[] = $e->plaintext;
}

Answer 2

可能是这样的？已测试。

<?php
require_once 'vendor/autoload.php';

$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");

$result['body'] = implode('. ', array_map(function($element) {
    return $element->plaintext;
}, $dom->find('div')));

echo $result['body'];

<body>
    <div class="H2">Header</div>
    <div class="P">this is a paragraph</div>
    <div class="P">this is another paragraph</div>
</body>

如何在 PHP 简单 HTML DOM 解析器中格式化明文？

How to format plaintext in PHP Simple HTML DOM Parser?

html

php

simple-html-dom

web-scraping