在同一 table 行上选择 URL 作为不同 <td> 中的文本
Selecting URL on same table row as text in different <td>
我有一个 table 正在尝试抓取,它看起来如下所示:
https://i.imgur.com/Hlemt1y.jpg
这里是table中一行的HTML:
<TR >
<script>
if (document.getElementById("Function").value != 'Customer')
document.write('<td align="center">CO</td>');</script>
<td align="left"><a href="OrdDetList.pgm?Order=8M216&Purpose=Customer&ShowPrice=&OpenOnly=Y">8M216 </a></td>
<script> if (document.getElementById("Function").value != 'Customer')
document.write('<td align="center">R</td> <td align="center">O</td>');</script>
<td align="center">Backordered</td>
<td align="left"><a href="OrdersList.pgm?Customer=33333&CompDiv=all&Function=Customer&OpenOnly=Y&ShowPrice= &YearsBack= ">70036</a>
<a class=info href="#"><img src="../images/help.gif" border=none>
<span>
<div id="SoldToNameAddress">
123 our address<br>
</div>
</span>
</a>
</td>
<td align="left">our company</td>
<td align="left"><a href="OrdersList.pgm?Customer=33333&CompDiv=all&Function=Customer&OpenOnly=Y&ShowPrice= &YearsBack= ">70037</a>
<a class=info href="#"><img src="../images/help.gif" border=none>
<span>
<div id="ShipToNameAddress">
our address
</div>
</span>
</a>
</td>
<td align="left">our company name</td>
<td align="left">70037</td>
<td align="left">052317</td>
<script>
if (document.getElementById("Function").value != 'Customer')
document.write('<td align="center">3</td>');</script>
<td align="left"><a class=info href="#">17/05/23<span>May 23, 2017 </span></a></td>
<td align="left"><a class=info href="#">17/05/23<span>May 23, 2017 </span></a></td>
<td align="center"></td>
</TR>
我的目标是 select URL:
<a href="OrdDetList.pgm?Order=8M216&Purpose=Customer&ShowPrice=&OpenOnly=Y">8M216 </a>
当我们的采购订单被搜索时<td align="left">052317</td>
我对使用 Xpath 还很陌生,所以到目前为止我能做的最多的就是能够通过搜索 8M216 直接获得 URL。但是我不确定如何使用 XPATH 根据另一个 table 单元格中的 PO 052317 给我另一个 URL。
到目前为止,这是我的代码,但由于我上面所说的,它有点无用:
<?php
$arrContextOptions=array(
"ssl"=>array(
"verify_peer"=>false,
"verify_peer_name"=>false,
),
);
$html = file_get_contents('https://thewebsiteiamscraping', false, stream_context_create($arrContextOptions)); //get the html returned from the following url
$order_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$order_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$order_xpath = new DOMXPath($order_doc);
//get order URLS based on our PO#
$order_row = $order_xpath->query('//a[text()="8M216 "]/@href');
if($order_row->length > 0){
foreach($order_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
I Think this XPath instruction will help you to find correct XPath
您可以在 Chrome 中打开一个控制台,然后通过键入 $x("your_xpath_here")
检查 XPath。这将 return 一个匹配值数组。如果它是空的,你知道页面上没有匹配项。
我有一个 table 正在尝试抓取,它看起来如下所示:
https://i.imgur.com/Hlemt1y.jpg
这里是table中一行的HTML:
<TR >
<script>
if (document.getElementById("Function").value != 'Customer')
document.write('<td align="center">CO</td>');</script>
<td align="left"><a href="OrdDetList.pgm?Order=8M216&Purpose=Customer&ShowPrice=&OpenOnly=Y">8M216 </a></td>
<script> if (document.getElementById("Function").value != 'Customer')
document.write('<td align="center">R</td> <td align="center">O</td>');</script>
<td align="center">Backordered</td>
<td align="left"><a href="OrdersList.pgm?Customer=33333&CompDiv=all&Function=Customer&OpenOnly=Y&ShowPrice= &YearsBack= ">70036</a>
<a class=info href="#"><img src="../images/help.gif" border=none>
<span>
<div id="SoldToNameAddress">
123 our address<br>
</div>
</span>
</a>
</td>
<td align="left">our company</td>
<td align="left"><a href="OrdersList.pgm?Customer=33333&CompDiv=all&Function=Customer&OpenOnly=Y&ShowPrice= &YearsBack= ">70037</a>
<a class=info href="#"><img src="../images/help.gif" border=none>
<span>
<div id="ShipToNameAddress">
our address
</div>
</span>
</a>
</td>
<td align="left">our company name</td>
<td align="left">70037</td>
<td align="left">052317</td>
<script>
if (document.getElementById("Function").value != 'Customer')
document.write('<td align="center">3</td>');</script>
<td align="left"><a class=info href="#">17/05/23<span>May 23, 2017 </span></a></td>
<td align="left"><a class=info href="#">17/05/23<span>May 23, 2017 </span></a></td>
<td align="center"></td>
</TR>
我的目标是 select URL:
<a href="OrdDetList.pgm?Order=8M216&Purpose=Customer&ShowPrice=&OpenOnly=Y">8M216 </a>
当我们的采购订单被搜索时<td align="left">052317</td>
我对使用 Xpath 还很陌生,所以到目前为止我能做的最多的就是能够通过搜索 8M216 直接获得 URL。但是我不确定如何使用 XPATH 根据另一个 table 单元格中的 PO 052317 给我另一个 URL。
到目前为止,这是我的代码,但由于我上面所说的,它有点无用:
<?php
$arrContextOptions=array(
"ssl"=>array(
"verify_peer"=>false,
"verify_peer_name"=>false,
),
);
$html = file_get_contents('https://thewebsiteiamscraping', false, stream_context_create($arrContextOptions)); //get the html returned from the following url
$order_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$order_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$order_xpath = new DOMXPath($order_doc);
//get order URLS based on our PO#
$order_row = $order_xpath->query('//a[text()="8M216 "]/@href');
if($order_row->length > 0){
foreach($order_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
I Think this XPath instruction will help you to find correct XPath
您可以在 Chrome 中打开一个控制台,然后通过键入 $x("your_xpath_here")
检查 XPath。这将 return 一个匹配值数组。如果它是空的,你知道页面上没有匹配项。