如何在保留内部 HTML 格式的同时使用 DOMDocument 访问 HTML 节点?
How do you access an HTML node using DOMDocument while retaining the inner HTML formatting?
我正在尝试使用 PHP 中的 DOMDocument 从 Google 文档访问电子表格单元格的内容。
我可以访问该节点,但内容是纯文本并且缺少 HTML 格式。
这是我正在使用的示例 link,其中包含粗体、斜体和带下划线的文本。
https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml
下面是我正在使用的 PHP 代码:
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$htmlData = curl_exec($curl);
curl_close($curl);
$dom = new \DOMDocument();
$html = $dom->loadHTML($htmlData);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('td');
$rowHeaders = array();
foreach ($cols as $i => $node) {
if($i >= 0 ) $rowHeaders[] = $node->textContent;
}
foreach ($rows as $i => $row){
if($i == 0 ) continue;
$cols = $row->getElementsByTagName('td');
$row = array();
foreach ($cols as $j => $node) {
$row[$rowHeaders[$j]] = $node->textContent;
}
$table[] = $row;
}
die(print_r($table));
我的输出缺少内部 HTML 格式:
[1] => Array
(
[Variable] => html_body
[Data] => Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
)
不要使用 textContent,试试看:
foreach ($cols as $j => $node) {
//$row[$rowHeaders[$j]] = $node->textContent;
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
$row[$rowHeaders[$j]]= $innerHTML;
}
我正在尝试使用 PHP 中的 DOMDocument 从 Google 文档访问电子表格单元格的内容。
我可以访问该节点,但内容是纯文本并且缺少 HTML 格式。
这是我正在使用的示例 link,其中包含粗体、斜体和带下划线的文本。
https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml
下面是我正在使用的 PHP 代码:
$url = "https://docs.google.com/spreadsheets/d/1Klpic32Gb_TDblDZDJQOkDedFGuNHAokxUXqrCPDFWE/pubhtml";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$htmlData = curl_exec($curl);
curl_close($curl);
$dom = new \DOMDocument();
$html = $dom->loadHTML($htmlData);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName('table');
$rows = $tables->item(0)->getElementsByTagName('tr');
$cols = $rows->item(1)->getElementsByTagName('td');
$rowHeaders = array();
foreach ($cols as $i => $node) {
if($i >= 0 ) $rowHeaders[] = $node->textContent;
}
foreach ($rows as $i => $row){
if($i == 0 ) continue;
$cols = $row->getElementsByTagName('td');
$row = array();
foreach ($cols as $j => $node) {
$row[$rowHeaders[$j]] = $node->textContent;
}
$table[] = $row;
}
die(print_r($table));
我的输出缺少内部 HTML 格式:
[1] => Array
(
[Variable] => html_body
[Data] => Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
)
不要使用 textContent,试试看:
foreach ($cols as $j => $node) {
//$row[$rowHeaders[$j]] = $node->textContent;
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child) {
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
$row[$rowHeaders[$j]]= $innerHTML;
}