使用 Goutte 从网页中提取正确的值
Extracting the proper value from a webpage with Goutte
我已经在我的 Laravel 5.7 应用程序中安装了 Goutte,我正在尝试从该页面抓取 COAL、GAS、HYDRO 和 WING(TNG 列)的值:
http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet
Route::get('hdtuto', function () {
$crawler = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet');
$aeso_data = $crawler->filter('TABLE > TR > TD');
dd($aeso_data);
});
我希望我能够使用此选项进行节点遍历:
$crawler->filter('body > p')->eq(0);
根据本指南:
https://symfony.com/doc/current/components/dom_crawler.html?any#node-traversing
所以我最终可以做这样的事情:
$coal = $crawler->filter('TABLE > TR > TD')->eq(15);
$gas = $crawler->filter('TABLE > TR > TD')->eq(20);
$hydro = $crawler->filter('TABLE > TR > TD')->eq(25);
$wind = $crawler->filter('TABLE > TR > TD')->eq(30);
这是我目前得到的样本:
Crawler {#471 ▼
#uri: "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"
-defaultNamespacePrefix: "default"
-namespaces: []
-baseHref: "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"
-document: DOMDocument {#407 ▶}
-nodes: array:598 [▼
0 => DOMElement {#469 ▼
+nodeName: "td"
+nodeValue: DOMImplementation {#430 ▶}
+nodeType: DOMDocumentType {#443 …}
+parentNode: DOMElement {#422}
+childNodes: DOMNodeList {#441 …1}
+firstChild: DOMElement {#442}
+lastChild: DOMElement {#442}
+previousSibling: DOMNodeList {#432 ▶}
+nextSibling: DOMText {#451}
+attributes: DOMNamedNodeMap {#452 …1}
+ownerDocument: DOMDocument {#407 ▶}
+namespaceURI: null
+prefix: ""
+localName: "td"
+baseURI: null
+textContent: ""
+tagName: "td"
+schemaTypeInfo: null
}
1 => DOMElement {#468 ▼
+nodeName: "td"
+nodeValue: ""
+nodeType: XML_ELEMENT_NODE
+parentNode: DOMElement {#1070}
+childNodes: DOMNodeList {#1071 …1}
+firstChild: DOMText {#1073}
+lastChild: DOMText {#1073}
+previousSibling: null
+nextSibling: null
+attributes: DOMNamedNodeMap {#1077 …1}
+ownerDocument: DOMDocument {#407 ▶}
+namespaceURI: null
+prefix: ""
+localName: "td"
+baseURI: null
+textContent: ""
+tagName: "td"
+schemaTypeInfo: null
我最终使用了这个:
Route::get('scrapertest', function() {
$crawler = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet');
$crawler2 = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/DailyAveragePoolPriceReportServlet');
$values = $crawler->filter('tr > td')->each(function ($node) {
return $node->text();
});
// dd($values);
$values2 = $crawler2->filter('tr > td')->each(function ($node) {
return $node->text();
});
// dd($values2);
$total = $values[11];
$internal_load = $values[15];
$net_to_grid = $values[17];
$coal = $values[36];
$gas = $values[40];
$hydro = $values[44];
$wind = $values[52];
echo 'Scraper test <br>';
echo 'Alberta Total Net Generation '.$total.'<br>';
echo 'Alberta Internal Load '.$internal_load.'<br>';
echo 'Net-To-Grid Generation '.$net_to_grid.'<br>';
echo 'Coal '.$coal.'<br>';
echo 'Gas '.$gas.'<br>';
echo 'Hydro '.$hydro.'<br>';
echo 'Wind '.$wind.'<br>';
});
我已经在我的 Laravel 5.7 应用程序中安装了 Goutte,我正在尝试从该页面抓取 COAL、GAS、HYDRO 和 WING(TNG 列)的值:
http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet
Route::get('hdtuto', function () {
$crawler = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet');
$aeso_data = $crawler->filter('TABLE > TR > TD');
dd($aeso_data);
});
我希望我能够使用此选项进行节点遍历:
$crawler->filter('body > p')->eq(0);
根据本指南:
https://symfony.com/doc/current/components/dom_crawler.html?any#node-traversing
所以我最终可以做这样的事情:
$coal = $crawler->filter('TABLE > TR > TD')->eq(15);
$gas = $crawler->filter('TABLE > TR > TD')->eq(20);
$hydro = $crawler->filter('TABLE > TR > TD')->eq(25);
$wind = $crawler->filter('TABLE > TR > TD')->eq(30);
这是我目前得到的样本:
Crawler {#471 ▼
#uri: "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"
-defaultNamespacePrefix: "default"
-namespaces: []
-baseHref: "http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet"
-document: DOMDocument {#407 ▶}
-nodes: array:598 [▼
0 => DOMElement {#469 ▼
+nodeName: "td"
+nodeValue: DOMImplementation {#430 ▶}
+nodeType: DOMDocumentType {#443 …}
+parentNode: DOMElement {#422}
+childNodes: DOMNodeList {#441 …1}
+firstChild: DOMElement {#442}
+lastChild: DOMElement {#442}
+previousSibling: DOMNodeList {#432 ▶}
+nextSibling: DOMText {#451}
+attributes: DOMNamedNodeMap {#452 …1}
+ownerDocument: DOMDocument {#407 ▶}
+namespaceURI: null
+prefix: ""
+localName: "td"
+baseURI: null
+textContent: ""
+tagName: "td"
+schemaTypeInfo: null
}
1 => DOMElement {#468 ▼
+nodeName: "td"
+nodeValue: ""
+nodeType: XML_ELEMENT_NODE
+parentNode: DOMElement {#1070}
+childNodes: DOMNodeList {#1071 …1}
+firstChild: DOMText {#1073}
+lastChild: DOMText {#1073}
+previousSibling: null
+nextSibling: null
+attributes: DOMNamedNodeMap {#1077 …1}
+ownerDocument: DOMDocument {#407 ▶}
+namespaceURI: null
+prefix: ""
+localName: "td"
+baseURI: null
+textContent: ""
+tagName: "td"
+schemaTypeInfo: null
我最终使用了这个:
Route::get('scrapertest', function() {
$crawler = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/CSDReportServlet');
$crawler2 = Goutte::request('GET', 'http://ets.aeso.ca/ets_web/ip/Market/Reports/DailyAveragePoolPriceReportServlet');
$values = $crawler->filter('tr > td')->each(function ($node) {
return $node->text();
});
// dd($values);
$values2 = $crawler2->filter('tr > td')->each(function ($node) {
return $node->text();
});
// dd($values2);
$total = $values[11];
$internal_load = $values[15];
$net_to_grid = $values[17];
$coal = $values[36];
$gas = $values[40];
$hydro = $values[44];
$wind = $values[52];
echo 'Scraper test <br>';
echo 'Alberta Total Net Generation '.$total.'<br>';
echo 'Alberta Internal Load '.$internal_load.'<br>';
echo 'Net-To-Grid Generation '.$net_to_grid.'<br>';
echo 'Coal '.$coal.'<br>';
echo 'Gas '.$gas.'<br>';
echo 'Hydro '.$hydro.'<br>';
echo 'Wind '.$wind.'<br>';
});