HtmlAgilityPack 修复不打开标签

HmlAgilityPack fix not open tag

我从 url html 页面获取。 在页面中我得到 table 热打开 <tr> 标签

<table class="transparent">
    <tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr>
    <td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr>
</table>

如何修复

<table class="transparent">
    <tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr>
    <tr><td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr>
</table>

我试过

private HtmlDocument GetHtmlDocument(string link)
{
    string url = "http://195.182.67.7/paslaugos/administratoriai/bankroto-administratoriai/" + link;
    var web = new HtmlWeb { AutoDetectEncoding = false, OverrideEncoding = Encoding.UTF8 };
    var doc = web.Load(url);
    doc.OptionFixNestedTags = true;
    doc.OptionAutoCloseOnEnd = true;
    doc.OptionCheckSyntax = true;

    // build a list of nodes ordered by stream position
    NodePositions pos = new NodePositions(doc);

    // browse all tags detected as not opened
    foreach (HtmlParseError error in doc.ParseErrors.Where(e => e.Code == HtmlParseErrorCode.TagNotOpened))
    {
        // find the text node just before this error
        var last = pos.Nodes.OfType<HtmlTextNode>().LastOrDefault(n => n.StreamPosition < error.StreamPosition);
        if (last != null)
        {
            // fix the text; reintroduce the broken tag
            last.Text = error.SourceText.Replace("/", "") + last.Text + error.SourceText;
        }
    }
    doc.Save(Console.Out);
    return doc;
}

但未修复

对于这个特定问题,您可以执行简单的正则表达式替换:

 string wrong = "<table class=\"transparent\"><tr><td>Sąrašo eil. Nr.:</td><td>B-FA001</td></tr><td>Įrašymo į Sąrašą data:</td><td>2006-11-13</td></tr></table>";
 Regex reg = new Regex(@"(?<!(?:<tr>)|(?:</td>))<td>");
 string right = reg.Replace(wrong, "<tr><td>");
 Console.WriteLine(right);