计算 HTML 文档中可见文本的出现次数

Question

我正在尝试计算 curl 请求返回的 HTML 文档中某个字符串的出现次数。我通常会使用 substr_count 来执行此操作，但我希望仅匹配用户可见的文本（在浏览器加载的页面中看到的文本），而不是源中的所有匹配项。举个例子，遇到下面这段：

<p class="example">example</p>

搜索字符串 "example"，我希望在这里计算一次出现，因为 class 名称应该从计数中省略。我目前正在使用 DOMXpath 来解析 HTML 文档的其他部分，因此我也考虑将其用于此目的，方法是使用：

$xpath->query("//text()[contains(., 'example')]");

我发现其他人用它来查找文档中的文本，但这似乎也计算标签中的结果。有没有办法只依靠用户可见的文本？我想指出，用户可见只是意味着文本不是元数据、属性等的一部分。如果组件的样式设置为不可见，但会产生可见文本，则应计算该文本。例如：

<p class="example" style="visibility:hidden">example</p>

应该仍然像以前一样计算一次出现。

编辑

strip_tags 将处理我展示的实例。有没有办法处理在脚本等中找到的实例？以下不应归因于计数：

<script type="text/javascript">var example = 1 ....other stuff....</script>

Answer 1

一种简单的方法是删除标签。

$str = '<p class="example">example</p>
<p class="example" style="visibility:hidden">example</p>
<script type="text/javascript">var example = 1 
....other stuff....
</script>';

$arr = explode(PHP_EOL, $str);

for($i = 0; $i < count($arr); $i++){

   if(strpos($arr[$i], "hidden") !== false){
       // remove hidden tag
       unset($arr[$i]);
   }else if(strpos($arr[$i], "<script") !== false){
        while(strpos($arr[$i], "</script") === false){
            // remove the scripts from the html. 
            unset($arr[$i]);
            $i++;
        }
        unset($arr[$i]); // and remove the last line with "</script"
   }
}
$str = implode(PHP_EOL, $arr);

Echo substr_count(strip_tags($str), "example");

https://3v4l.org/d4JN5

计算 HTML 文档中可见文本的出现次数

Count Occurrences of Visible Text in HTML Document

html

php

domxpath