PHP str_replace 用通配符抓取内容?
PHP str_replace scraped content with wild card?
我正在寻找一种解决方案,以从已抓取的 HTML 页面中删除一些 HTML。该页面有一些我想删除的重复数据,所以我尝试使用 preg_replace() 删除变量数据。
我要剥离的数据:
Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
....
...
之后一定是这样:
Producent:Example
Groep:Example1
Type:Example2
所以一大块除了data-title块中的单词外是一样的。这条数据怎么删除?
我尝试了一些这样的事情:
$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);
但这没有用。有什么解决办法吗?
假设字符串如下所示:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
你可以这样得到字符串的开头和结尾:
preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
echo implode([$matches[1], $matches[2]]);
在这种情况下,将抛出 Producent:Example。因此,您可以将此输出添加到您打算使用的另一个 variable/array。
或者,既然你提到了 replacing:
$string = preg_replace('/^(\w+:).*\>(\w+)/', '', $string);
但话又说回来,检查它可能会出现在可变行数中:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
$stringRows = explode(PHP_EOL, $string);
$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '';
foreach ($stringRows as &$stringRow) {
$stringRow = preg_replace($pattern, $replacement, $stringRow);
}
$string = implode(PHP_EOL, $stringRows);
然后将输出您期望的字符串。
解释我的正则表达式:
第一组抓住 first 单词直到两个点 :
,然后另一组抓住 last 单词。我之前已经为两端指定了锚点,但是当打破每一行时,这不会按预期工作,所以我只保留了开头。
^(\w+:) => the word in the beginning of the string until two dots appear
.*\> => everything else until smaller symbol appears (escaped by slash)
(\w+) => the word after the smaller than symbol
好吧,也许我的问题写得不太好。我有一个 table 需要从网站上抓取。我需要 table 中的信息,但必须如前所述清理某些部分。我最终做出的解决方案是这个,并且有效。它仍然需要一些手动替换工作,但那是因为它们用于英寸的愚蠢 "。;-)
解决方案:
\ find the table in the sourcecode
foreach($techdata->find('table') as $table){
\ filter out the rows
foreach($table->find('tr') as $row){
\ take the innertext using simplehtmldom
$tech_specs = $row->innertext;
\ strip some 'garbage'
$tech_specs = str_replace(" \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
\ find the first word of the string so I can use it
$spec1 = explode('</td>', $tech_specs)[0];
\ use the found string to strip down the rest of the table
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
\ strip some 'garbage'
$tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
$tech_specs = str_replace("</td>","", $tech_specs);
$tech_specs = str_replace(" ","", $tech_specs);
\ put the clean row in an array ready for usage
$specs[] = $tech_specs;
}
}
只需使用通配符:
$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
.*?
表示匹配任何东西但不要贪心
我正在寻找一种解决方案,以从已抓取的 HTML 页面中删除一些 HTML。该页面有一些我想删除的重复数据,所以我尝试使用 preg_replace() 删除变量数据。
我要剥离的数据:
Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2
....
...
之后一定是这样:
Producent:Example
Groep:Example1
Type:Example2
所以一大块除了data-title块中的单词外是一样的。这条数据怎么删除?
我尝试了一些这样的事情:
$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);
但这没有用。有什么解决办法吗?
假设字符串如下所示:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example';
你可以这样得到字符串的开头和结尾:
preg_match('/^(\w+:).*\>(\w+)/', $string, $matches);
echo implode([$matches[1], $matches[2]]);
在这种情况下,将抛出 Producent:Example。因此,您可以将此输出添加到您打算使用的另一个 variable/array。 或者,既然你提到了 replacing:
$string = preg_replace('/^(\w+:).*\>(\w+)/', '', $string);
但话又说回来,检查它可能会出现在可变行数中:
$string = 'Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2';
$stringRows = explode(PHP_EOL, $string);
$pattern = '/^(\w+:).*\>(\w+)/';
$replacement = '';
foreach ($stringRows as &$stringRow) {
$stringRow = preg_replace($pattern, $replacement, $stringRow);
}
$string = implode(PHP_EOL, $stringRows);
然后将输出您期望的字符串。
解释我的正则表达式:
第一组抓住 first 单词直到两个点 :
,然后另一组抓住 last 单词。我之前已经为两端指定了锚点,但是当打破每一行时,这不会按预期工作,所以我只保留了开头。
^(\w+:) => the word in the beginning of the string until two dots appear
.*\> => everything else until smaller symbol appears (escaped by slash)
(\w+) => the word after the smaller than symbol
好吧,也许我的问题写得不太好。我有一个 table 需要从网站上抓取。我需要 table 中的信息,但必须如前所述清理某些部分。我最终做出的解决方案是这个,并且有效。它仍然需要一些手动替换工作,但那是因为它们用于英寸的愚蠢 "。;-)
解决方案:
\ find the table in the sourcecode
foreach($techdata->find('table') as $table){
\ filter out the rows
foreach($table->find('tr') as $row){
\ take the innertext using simplehtmldom
$tech_specs = $row->innertext;
\ strip some 'garbage'
$tech_specs = str_replace(" \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
\ find the first word of the string so I can use it
$spec1 = explode('</td>', $tech_specs)[0];
\ use the found string to strip down the rest of the table
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
\ manual correction because of the " used
$tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
\ strip some 'garbage'
$tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
$tech_specs = str_replace("</td>","", $tech_specs);
$tech_specs = str_replace(" ","", $tech_specs);
\ put the clean row in an array ready for usage
$specs[] = $tech_specs;
}
}
只需使用通配符:
$newstr = preg_replace('/<td class="datatable__body__item" data-title=".*?">/', '', $str);
.*?
表示匹配任何东西但不要贪心