如何通过 html 内容获取 href 和文本内容
how to get href and text content by html Content
我想要获取内容 url 包括所有其他 td 数据。
我的代码:
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
$htmlContent = file_get_contents("https://www.iana.org/domains/root/db", false, $context);
$DOM = new DOMDocument();
$DOM->loadHTML($htmlContent);
$FirstdTable = $DOM->getElementsByTagName('table')->item(0);
$Header = $FirstdTable->getElementsByTagName('th');
$Detail = $FirstdTable->getElementsByTagName('td');
//#Get header name of the table
foreach($Header as $NodeHeader)
{
$aDataTableHeaderHTML[] = trim($NodeHeader->textContent);
}
//#Get row data/detail table without header name as key
$i = 0;
$j = 0;
foreach($Detail as $sNodeDetail)
{
$aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
$i = $i + 1;
$j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}
当前输出:
Array
(
[0] => Array
(
[0] => .aaa
[1] => generic
[2] => American Automobile Association, Inc.
)
[1] => Array
(
[0] => .aarp
[1] => generic
[2] => AARP
)
[2] => Array
(
[0] => .abarth
[1] => generic
[2] => Fiat Chrysler Automobiles N.V.
)
}
我想当:
Array
(
[0] => Array
(
[0] => .aaa
[1] => generic
[2] => American Automobile Association, Inc.
[3] => https://www.iana.org/domains/root/db/aaa.html
)
[1] => Array
(
[0] => .aarp
[1] => generic
[2] => AARP
[3] => https://www.iana.org/domains/root/db/aarp.html
)
[2] => Array
(
[0] => .abarth
[1] => generic
[2] => Fiat Chrysler Automobiles N.V.
[3] => https://www.iana.org/domains/root/db/abarth.html
)
}
目前,您只能获取所有 <td>
中的所有文本内容。而且它不会在锚标记内包含 link。为此,您需要更深入地研究 <td>
.
这是使用 xpath
的一种方法:
$xpath = new DOMXpath($DOM);
$base = 'https://www.iana.org/';
foreach($Detail as $sNodeDetail)
{
$aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
if ($link = $xpath->evaluate("string(./span[contains(@class, 'domain')]/a/@href)", $sNodeDetail)) {
$aDataTableDetailHTML[$j][] = "{$base}{$link}";
}
$i = $i + 1;
$j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}
基本上,如果迭代中的当前 <td>
具有 <span class="domain tld"><a href="xxxx">xxx</a></span>
,则查询仅提取 href
值并获取 href
值。
另一种方法是迭代每个 <tr>
而不是每个 <td>
:
$aDataTableDetailHTML = [];
$DOM = new DOMDocument();
$DOM->loadHTML($htmlContent);
$xpath = new DOMXpath($DOM);
$base = 'https://www.iana.org/';
foreach($xpath->query('//table[@id="tld-table"]/tbody/tr') as $row) {
$domain = trim($xpath->evaluate("string(./td[1])", $row));
$type = $xpath->evaluate("string(./td[2])", $row);
$tld_manager = $xpath->evaluate("string(./td[3])", $row);
$url = $xpath->evaluate("string(./td[1]/span/a/@href)", $row);
$aDataTableDetailHTML[] = [$domain, $type, $tld_manager, "{$base}{$url}"];
}
我想要获取内容 url 包括所有其他 td 数据。
我的代码:
$context = stream_context_create(
array(
"http" => array(
"header" => "User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
)
)
);
$htmlContent = file_get_contents("https://www.iana.org/domains/root/db", false, $context);
$DOM = new DOMDocument();
$DOM->loadHTML($htmlContent);
$FirstdTable = $DOM->getElementsByTagName('table')->item(0);
$Header = $FirstdTable->getElementsByTagName('th');
$Detail = $FirstdTable->getElementsByTagName('td');
//#Get header name of the table
foreach($Header as $NodeHeader)
{
$aDataTableHeaderHTML[] = trim($NodeHeader->textContent);
}
//#Get row data/detail table without header name as key
$i = 0;
$j = 0;
foreach($Detail as $sNodeDetail)
{
$aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
$i = $i + 1;
$j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}
当前输出:
Array
(
[0] => Array
(
[0] => .aaa
[1] => generic
[2] => American Automobile Association, Inc.
)
[1] => Array
(
[0] => .aarp
[1] => generic
[2] => AARP
)
[2] => Array
(
[0] => .abarth
[1] => generic
[2] => Fiat Chrysler Automobiles N.V.
)
}
我想当:
Array
(
[0] => Array
(
[0] => .aaa
[1] => generic
[2] => American Automobile Association, Inc.
[3] => https://www.iana.org/domains/root/db/aaa.html
)
[1] => Array
(
[0] => .aarp
[1] => generic
[2] => AARP
[3] => https://www.iana.org/domains/root/db/aarp.html
)
[2] => Array
(
[0] => .abarth
[1] => generic
[2] => Fiat Chrysler Automobiles N.V.
[3] => https://www.iana.org/domains/root/db/abarth.html
)
}
目前,您只能获取所有 <td>
中的所有文本内容。而且它不会在锚标记内包含 link。为此,您需要更深入地研究 <td>
.
这是使用 xpath
的一种方法:
$xpath = new DOMXpath($DOM);
$base = 'https://www.iana.org/';
foreach($Detail as $sNodeDetail)
{
$aDataTableDetailHTML[$j][] = trim($sNodeDetail->textContent);
if ($link = $xpath->evaluate("string(./span[contains(@class, 'domain')]/a/@href)", $sNodeDetail)) {
$aDataTableDetailHTML[$j][] = "{$base}{$link}";
}
$i = $i + 1;
$j = $i % count($aDataTableHeaderHTML) == 0 ? $j + 1 : $j;
}
基本上,如果迭代中的当前 <td>
具有 <span class="domain tld"><a href="xxxx">xxx</a></span>
,则查询仅提取 href
值并获取 href
值。
另一种方法是迭代每个 <tr>
而不是每个 <td>
:
$aDataTableDetailHTML = [];
$DOM = new DOMDocument();
$DOM->loadHTML($htmlContent);
$xpath = new DOMXpath($DOM);
$base = 'https://www.iana.org/';
foreach($xpath->query('//table[@id="tld-table"]/tbody/tr') as $row) {
$domain = trim($xpath->evaluate("string(./td[1])", $row));
$type = $xpath->evaluate("string(./td[2])", $row);
$tld_manager = $xpath->evaluate("string(./td[3])", $row);
$url = $xpath->evaluate("string(./td[1]/span/a/@href)", $row);
$aDataTableDetailHTML[] = [$domain, $type, $tld_manager, "{$base}{$url}"];
}