如何使用 php 简单 html dom 或 Curl 从 div 抓取 HTML 标签
How to Scrape HTML tags from a div using php simple html dom or Curl
这是我想做的一个例子
示例:
<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
从上面的文件中,我想抓取数组中的数据和标签。
结果我想要一个包含以下内容的数组:
arr = [h1,p,h2];
和另一个数组:
arr2 = [这是h1,这是段落,这是h2]
试试这个;
$str = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$arr = explode(PHP_EOL, $str);
$res =array();
Foreach($arr as $row){
If(!strpos($row, "div") !== False){
$res[substr($row, 1, strpos($row, ">")-1)] = strip_tags($row);
}
}
Var_dump($res);
它一次循环一行并创建具有命名键的数组。
编辑,如果有多个房间,你可以像这样把它变成多维的:
https://3v4l.org/DdXVd
$str = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
<div class='room2'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$arr = explode(PHP_EOL, $str);
$res =array();
Foreach($arr as $row){
If(strpos($row, "div") !== False){
$pos1 = strpos($row, "'")+1;
$room = substr($row, $pos1, strpos($row, "'", $pos1)-$pos1);
}Else{
$pos1 = strpos($row, "<")+1;
$res[$room][substr($row, strpos($row, "<")+1, strpos($row, ">")-$pos1)] = trim(strip_tags($row));
}
}
Var_dump($res);
假设元素已知,您可以像这样使用 domdocument's getelementsbytagname:
$html = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$doc = new DOMDocument();
$doc->loadhtML($html);
$elements = array();
$content = array();
function iterate_elements($array, $doc){
global $elements, $content;
foreach($array as $element){
$the_element = $doc->getElementsByTagName($element);
foreach($the_element as $target){
$content[] = $target->textContent;
//$target->tagName;
}
if(!empty($the_element->length)) {
$elements[] = $element;
}
}
}
iterate_elements(array('h1','p', 'h2'), $doc);
print_r($elements);
print_r($content);
试试下面的代码。
$html = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$dom = new SimpleXMLElement( $html );
$values = array_filter( array_values( (array) $dom ), function ( $i ) { return ! is_array( $i ); } );
$keys = array_filter( array_keys( (array) $dom ), function ( $i ) { return $i != '@attributes'; } );
print_r( $values ); // This is a h1, This is a Paragraph, This is h2
print_r( $keys ); // h1, p, h2
我使用 array_filter
从结果中删除 div 标签。
$str = <<<EOF
<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
EOF;
$html = str_get_html($str);
foreach($html->find('.room *') as $el){
$arr[] = $el->tag;
$arr2[] = $el->text();
}
这是我想做的一个例子 示例:
<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
从上面的文件中,我想抓取数组中的数据和标签。 结果我想要一个包含以下内容的数组: arr = [h1,p,h2]; 和另一个数组: arr2 = [这是h1,这是段落,这是h2]
试试这个;
$str = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$arr = explode(PHP_EOL, $str);
$res =array();
Foreach($arr as $row){
If(!strpos($row, "div") !== False){
$res[substr($row, 1, strpos($row, ">")-1)] = strip_tags($row);
}
}
Var_dump($res);
它一次循环一行并创建具有命名键的数组。
编辑,如果有多个房间,你可以像这样把它变成多维的:
https://3v4l.org/DdXVd
$str = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
<div class='room2'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$arr = explode(PHP_EOL, $str);
$res =array();
Foreach($arr as $row){
If(strpos($row, "div") !== False){
$pos1 = strpos($row, "'")+1;
$room = substr($row, $pos1, strpos($row, "'", $pos1)-$pos1);
}Else{
$pos1 = strpos($row, "<")+1;
$res[$room][substr($row, strpos($row, "<")+1, strpos($row, ">")-$pos1)] = trim(strip_tags($row));
}
}
Var_dump($res);
假设元素已知,您可以像这样使用 domdocument's getelementsbytagname:
$html = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$doc = new DOMDocument();
$doc->loadhtML($html);
$elements = array();
$content = array();
function iterate_elements($array, $doc){
global $elements, $content;
foreach($array as $element){
$the_element = $doc->getElementsByTagName($element);
foreach($the_element as $target){
$content[] = $target->textContent;
//$target->tagName;
}
if(!empty($the_element->length)) {
$elements[] = $element;
}
}
}
iterate_elements(array('h1','p', 'h2'), $doc);
print_r($elements);
print_r($content);
试试下面的代码。
$html = "<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>";
$dom = new SimpleXMLElement( $html );
$values = array_filter( array_values( (array) $dom ), function ( $i ) { return ! is_array( $i ); } );
$keys = array_filter( array_keys( (array) $dom ), function ( $i ) { return $i != '@attributes'; } );
print_r( $values ); // This is a h1, This is a Paragraph, This is h2
print_r( $keys ); // h1, p, h2
我使用 array_filter
从结果中删除 div 标签。
$str = <<<EOF
<div class='room'>
<h1>This is a h1</h1>
<p>This is a Paragraph</p>
<h2>This is h2</h2>
</div>
EOF;
$html = str_get_html($str);
foreach($html->find('.room *') as $el){
$arr[] = $el->tag;
$arr2[] = $el->text();
}