从 table 元素中分解出 html 元素

Question

我在寻找从以下代码中分离出 H4 标签的正确方法时遇到了问题。我不仅需要让它留在代码中，而且我还需要删除它当前所在的 table。

那么，如何删除整个 table 并保留 h4 标签？

<table align="center" border="0" cellpadding="0" cellspacing="0">
<tr><td height="30" align="center" colspan="5"><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
  <tr> 
    <td><a href="index.html" target="_top" onclick="MM_nbGroup('down','group1','contents','',1)" onmouseover="MM_nbGroup('over','contents','../figs/contents1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="contents" src="../figs/contents.gif" border="0" alt="" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','authorindex','',1)" onmouseover="MM_nbGroup('over','authorindex','../figs/iauthori1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/iauthori.gif" alt="" name="authorindex" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','subjindex','',1)" onmouseover="MM_nbGroup('over','subjindex','../figs/isubji1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isubji.gif" alt="" name="subjindex" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../search.html" target="_top" onclick="MM_nbGroup('down','group1','search','',1)" onmouseover="MM_nbGroup('over','search','../figs/isearch1.gif','',1)" onmouseout="MM_nbGroup('out')"><img src="../figs/isearch.gif" alt="" name="search" width="120" height="20" border="0" onload=""></a></td>
    <td><a href="../page.html" target="_top" onclick="MM_nbGroup('down','group1','home','',1)" onmouseover="MM_nbGroup('over','home','../figs/ihome1.gif','',1)" onmouseout="MM_nbGroup('out')"><img name="home" src="../figs/ihome.gif" border="0" alt="" onload=""></a></td>
  </tr>
</table>

此外，我有大约 2500 个 html-文档遵循类似的结构，但在 HTML 的不同版本中，因此使用 div 的，table 的或版本之间的其他元素。所以我需要一种方法来正确地改变这个方法。

我已准备好加载文档，它将所有文件加载到一个列表中，因此我将提供一个方法来打开和解析这个文件名列表。但是我不知道如何为这个使用 XPath。

Answer 1

解决问题的一种方法是找到所有 <h4> 个节点，沿着它的父链向上走，直到找到 stop tag/node，然后替换停止 tag/node 与你的 <h4>:

给定一些样本 HTML 驻留在 HTML 文件:

var html =
@"<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<table align='center' border='0' cellpadding='0' cellspacing='0'>
<tr><td height='30' align='center' colspan='5'><h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4></td></tr>
  <tr> 
    <td><a href='index.html'><img name='contents' src='../figs/contents.gif' border='0' alt='' onload=''></a></td>
    <td><a href='../page.html'><img src='../figs/iauthori.gif' alt='' name='authorindex' width='120' height='20' border='0' onload=''></a></td>
    <td><a href='../page.html'><img src='../figs/isubji.gif' alt='' name='subjindex' width='120' height='20' border='0' onload=''></a></td>
    <td><a href='../search.html'><img src='../figs/isearch.gif' alt='' name='search' width='120' height='20' border='0' onload=''></a></td>
    <td><a href='../page.html'><img name='home' src='../figs/ihome.gif' border='0' alt='' onload=''></a></td>
  </tr>
</table>

<div>
<h4>H4 nested in DIV</h4>
<p>Paragraph <strong>bold</strong> <a href=''>Hyperlink</a></p>
</div>

<p><h4>H4 nested in P</h4></p>

</body>
</html>";

用这个方法解析：

public string ParseHtmlToString(string inputFilePath)
{
    var document = new HtmlDocument();
    document.Load(inputFilePath);
    var wantedNodes = document.DocumentNode.SelectNodes("//h4");
    // stop at these tags while walking backwards up the chain
    var stopTags = new string[] { "table", "div", "p" };
    HtmlNode parentNode;

    foreach (var node in wantedNodes)
    {
        HtmlNode testNode = node;
        while ((parentNode = testNode.ParentNode) != null)
        {
            if (stopTags.Contains(parentNode.Name))
            {
                parentNode.ParentNode.ReplaceChild(node, parentNode);
            }
            testNode = parentNode;
        }
    }

    return document.DocumentNode.WriteTo();
}

然后你可以将解析后的HTML分配给这样的变量：

var parsedHtml = ParseHtmlToString(INPUT_FILE);

其中 returns 以下值：

<!doctype html system 'html.dtd'>
<html><head></head>
<body>
<h4>IMPORTANT HEADLINE ABOUT THIS PARTICULAR PAGE</h4>

<h4>H4 nested in DIV</h4>

<h4>H4 nested in P</h4>

</body>
</html>

Answer 2

这是一个替代解决方案，它适用于所有 Kuujinbo-solution 失败的文档，我运行它们并排显示为 try/final/catch-method。它在整个 2500 html-docs.

期间都运行良好

var doc = new HtmlDocument();
doc.Load(file);
var htmlBody = doc.DocumentNode.SelectSingleNode("//body");
var headerTables = doc.DocumentNode.SelectSingleNode("//body/table[1]");
var headerNode = doc.DocumentNode.SelectSingleNode("//h4[contains(text(),'Information Research, Vol')]");
htmlBody.ReplaceChild(headerNode, headerTables);
headerTables.Remove();
doc.Save(file);

基本上是运行因为

try {ParseHtmlToString(file)}
final {myAlternateSolution(file)}
catch (Exception Ex){Console.WriteLine(file +":"+ Ex.Message);}

它起作用是因为 table 大多数时候是 body 之后的第一个节点，它也是文档中的第一个 table。由于某些文档格式不正确 HTML，并且无法使用 HTMLTidy 和类似工具进行修复，因此必须进行一些手动编辑。

从 table 元素中分解出 html 元素

Break out an html-element from within a table-element

html

c#

xpath

html-agility-pack