如何使用simplehtmldom从此页面中提取数据
How to use simplehtmldom to extract data from this page
我正在尝试使用 simplehtmldom 从 https://benthamopen.com/browse-by-title/B/1/ 中提取信息。
具体来说,我想访问页面的以下部分:
<div style="padding:10px;">
<strong>ISSN: </strong>1874-1207<br><div class="sharethis-inline-share-buttons" style="padding-top:10px;" data-url="https://benthamopen.com/TOBEJ/home/" data-title="The Open Biomedical Engineering Journal"></div>
</div>
我有这个代码:
$html = file_get_html('https://benthamopen.com/browse-by-title/B/1/');
foreach($html->find('div[style=padding:10px;]') as $ele) {
echo("<pre>".print_r($ele,true)."</pre>");
}
... returns(我只显示页面中的一项)
simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => div
[attributes] => Array
(
[style] => padding:10px;
)
[nodes] => Array
(
[0] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => strong
[attributes] => none
[nodes] => none
)
[1] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_TEXT (3)
[tag] => text
[attributes] => none
[nodes] => none
)
[2] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => br
[attributes] => none
[nodes] => none
)
[3] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => div
[attributes] => Array
(
[class] => sharethis-inline-share-buttons
[style] => padding-top:10px;
[data-url] => https://benthamopen.com/TOBEJ/home/
[data-title] => The Open Biomedical Engineering Journal
)
[nodes] => none
)
)
)
我不确定如何从这里开始。我要提取:
- ISSN 文本(在 echo 语句中未显示 - 不确定原因)[上例中的 1874-1207]。它是 [nodes]
的元素零
- 'data-url' [https://benthamopen.com/TOBEJ/home/,在上面的例子中]
- 'data-title' [The Open Biomedical Engineering Journal, 在上面的例子中]
可能是我对PHP对象和数组的理解不尽如人意,不知道为什么回显语句中没有显示ISSN。
我已经尝试了各种(很多)事情,但只是努力从元素中提取数据。
我不熟悉 simplehtmldom,只是想避免它。因此,我将提出一个使用 PHP 的内置 DOM 类:
的解决方案
<?php
libxml_use_internal_errors(true);
// get the HTML
$html = file_get_contents("https://benthamopen.com/browse-by-title/B/1/");
// create a DOM object and load it up
$dom = new DomDocument();
$dom->loadHtml($html);
// create an XPath object and query it
$xpath = new DomXPath($dom);
$elements = $xpath->query("//div[@style='padding:10px;']");
// loop through the matches
foreach ($elements as $el) {
// skip elements without ISSN
$text = trim($el->textContent);
if (strpos($text, "ISSN") !== 0) {
continue;
}
// get the first div inside this thing
$div = $el->getElementsByTagName("div")[0];
// dump it out
printf("%s %s %s<br/>\n", str_replace("ISSN: ", "", $text), $div->getAttribute("data-title"), $div->getAttribute("data-url"));
}
XPath 的东西可能有点让人不知所措,但对于像这样的简单搜索,它与 CSS 选择器没有太大区别。希望评论能解释一切,如果没有,请告诉我!
输出:
1874-1207 The Open Biomedical Engineering Journal https://benthamopen.com/TOBEJ/home/<br/>
1874-1967 The Open Biology Journal https://benthamopen.com/TOBIOJ/home/<br/>
1874-091X The Open Biochemistry Journal https://benthamopen.com/TOBIOCJ/home/<br/>
1875-0362 The Open Bioinformatics Journal https://benthamopen.com/TOBIOIJ/home/<br/>
1875-3183 The Open Biomarkers Journal https://benthamopen.com/TOBIOMJ/home/<br/>
2665-9956 The Open Biomaterials Science Journal https://benthamopen.com/TOBMSJ/home/<br/>
1874-0707 The Open Biotechnology Journal https://benthamopen.com/TOBIOTJ/home/<br/>
我正在尝试使用 simplehtmldom 从 https://benthamopen.com/browse-by-title/B/1/ 中提取信息。
具体来说,我想访问页面的以下部分:
<div style="padding:10px;">
<strong>ISSN: </strong>1874-1207<br><div class="sharethis-inline-share-buttons" style="padding-top:10px;" data-url="https://benthamopen.com/TOBEJ/home/" data-title="The Open Biomedical Engineering Journal"></div>
</div>
我有这个代码:
$html = file_get_html('https://benthamopen.com/browse-by-title/B/1/');
foreach($html->find('div[style=padding:10px;]') as $ele) {
echo("<pre>".print_r($ele,true)."</pre>");
}
... returns(我只显示页面中的一项)
simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => div
[attributes] => Array
(
[style] => padding:10px;
)
[nodes] => Array
(
[0] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => strong
[attributes] => none
[nodes] => none
)
[1] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_TEXT (3)
[tag] => text
[attributes] => none
[nodes] => none
)
[2] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => br
[attributes] => none
[nodes] => none
)
[3] => simplehtmldom\HtmlNode Object
(
[nodetype] => HDOM_TYPE_ELEMENT (1)
[tag] => div
[attributes] => Array
(
[class] => sharethis-inline-share-buttons
[style] => padding-top:10px;
[data-url] => https://benthamopen.com/TOBEJ/home/
[data-title] => The Open Biomedical Engineering Journal
)
[nodes] => none
)
)
)
我不确定如何从这里开始。我要提取:
- ISSN 文本(在 echo 语句中未显示 - 不确定原因)[上例中的 1874-1207]。它是 [nodes] 的元素零
- 'data-url' [https://benthamopen.com/TOBEJ/home/,在上面的例子中]
- 'data-title' [The Open Biomedical Engineering Journal, 在上面的例子中]
可能是我对PHP对象和数组的理解不尽如人意,不知道为什么回显语句中没有显示ISSN。
我已经尝试了各种(很多)事情,但只是努力从元素中提取数据。
我不熟悉 simplehtmldom,只是想避免它。因此,我将提出一个使用 PHP 的内置 DOM 类:
的解决方案<?php
libxml_use_internal_errors(true);
// get the HTML
$html = file_get_contents("https://benthamopen.com/browse-by-title/B/1/");
// create a DOM object and load it up
$dom = new DomDocument();
$dom->loadHtml($html);
// create an XPath object and query it
$xpath = new DomXPath($dom);
$elements = $xpath->query("//div[@style='padding:10px;']");
// loop through the matches
foreach ($elements as $el) {
// skip elements without ISSN
$text = trim($el->textContent);
if (strpos($text, "ISSN") !== 0) {
continue;
}
// get the first div inside this thing
$div = $el->getElementsByTagName("div")[0];
// dump it out
printf("%s %s %s<br/>\n", str_replace("ISSN: ", "", $text), $div->getAttribute("data-title"), $div->getAttribute("data-url"));
}
XPath 的东西可能有点让人不知所措,但对于像这样的简单搜索,它与 CSS 选择器没有太大区别。希望评论能解释一切,如果没有,请告诉我!
输出:
1874-1207 The Open Biomedical Engineering Journal https://benthamopen.com/TOBEJ/home/<br/>
1874-1967 The Open Biology Journal https://benthamopen.com/TOBIOJ/home/<br/>
1874-091X The Open Biochemistry Journal https://benthamopen.com/TOBIOCJ/home/<br/>
1875-0362 The Open Bioinformatics Journal https://benthamopen.com/TOBIOIJ/home/<br/>
1875-3183 The Open Biomarkers Journal https://benthamopen.com/TOBIOMJ/home/<br/>
2665-9956 The Open Biomaterials Science Journal https://benthamopen.com/TOBMSJ/home/<br/>
1874-0707 The Open Biotechnology Journal https://benthamopen.com/TOBIOTJ/home/<br/>