PHP str_replace 用通配符抓取内容？

Question

我正在寻找一种解决方案，以从已抓取的 HTML 页面中删除一些 HTML。该页面有一些我想删除的重复数据，所以我尝试使用 preg_replace() 删除变量数据。

我要剥离的数据：

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
.... 
...

之后一定是这样：

Producent:Example
Groep:Example1
Type:Example2

所以一大块除了data-title块中的单词外是一样的。这条数据怎么删除？

我尝试了一些这样的事情：

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

但这没有用。有什么解决办法吗？

Answer 1

假设字符串如下所示：

$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';

你可以这样得到字符串的开头和结尾：

preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);

echo implode([$matches[1], $matches[2]]);

在这种情况下，将抛出 Producent:Example。因此，您可以将此输出添加到您打算使用的另一个 variable/array。或者，既然你提到了 replacing:

$string = preg_replace('/^(\w+:).*\>(\w+)/', '', $string);

但话又说回来，检查它可能会出现在可变行数中：

$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';

$stringRows = explode(PHP_EOL, $string);

$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '';
foreach ($stringRows as &$stringRow) {
    $stringRow = preg_replace($pattern, $replacement, $stringRow);
}

$string = implode(PHP_EOL, $stringRows);

然后将输出您期望的字符串。

解释我的正则表达式：第一组抓住 first 单词直到两个点 :，然后另一组抓住 last 单词。我之前已经为两端指定了锚点，但是当打破每一行时，这不会按预期工作，所以我只保留了开头。

^(\w+:) => the word in the beginning of the string until two dots appear
.*\>    => everything else until smaller symbol appears (escaped by slash)
(\w+)   => the word after the smaller than symbol

Answer 2

好吧，也许我的问题写得不太好。我有一个 table 需要从网站上抓取。我需要 table 中的信息，但必须如前所述清理某些部分。我最终做出的解决方案是这个，并且有效。它仍然需要一些手动替换工作，但那是因为它们用于英寸的愚蠢 "。;-)

解决方案：

   \ find the table in the sourcecode
   foreach($techdata->find('table') as $table){

    \ filter out the rows
    foreach($table->find('tr') as $row){

    \ take the innertext using simplehtmldom
    $tech_specs = $row->innertext;

    \ strip some 'garbage'
    $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);

    \ find the first word of the string so I can use it    
    $spec1 = explode('</td>', $tech_specs)[0];

    \ use the found string to strip down the rest of the table
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);

    \ manual correction because of the " used
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);

    \ manual correction because of the " used
    $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);

    \ strip some 'garbage'
    $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
    $tech_specs = str_replace("</td>","", $tech_specs);
    $tech_specs = str_replace("  ","", $tech_specs);

    \ put the clean row in an array ready for usage
    $specs[] = $tech_specs;
    }
  }

Answer 3

只需使用通配符：

$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);

.*?表示匹配任何东西但不要贪心

PHP str_replace 用通配符抓取内容？

PHP str_replace scraped content with wild card?

php

preg-replace

simple-html-dom