为什么这段代码在查找 HTML 元素时比其他代码执行速度快 89%？有什么不同？

Question

想象一下下面的代码：

var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.OptionAutoCloseOnEnd = false;
htmlDoc.OptionCheckSyntax = false;
htmlDoc.OptionFixNestedTags = false;
htmlDoc.OptionOutputOptimizeAttributeValues = false;
htmlDoc.LoadHtml(html); /*Where html is a string of 5MB size.*/

/*First approach to select all "anchor" elements*/
HtmlNodeCollection coll = htmlDoc.DocumentNode.SelectNodes("//*/a");
if (coll != null && coll.Count > 0)
    ReplaceSourceLinks(coll, "href");

上面的代码应该加载一个 ~5MB HTML 字符串，并将 HTML 中找到的所有 9567 锚点 href 替换为适合应用程序的内容。上面的代码执行需要1998ms。

所以我决定用以下代码替换上面显示的最后 3 行，即，我决定使用以下代码，而不是使用 XPATH 来处理这些锚点：

IEnumerable<HtmlNode> coll = htmlDoc.DocumentNode.Descendants("a");
if (coll != null)
    ReplaceSourceLinks(coll, "href");

新方法执行仅需 220 毫秒！比第一种方法快了近 89%。我只想知道这些代码是否相同。他们是否针对同一组锚点？（顺便说一句，第二个也选择相同的 9567 个元素）。为什么第二种方法执行速度快 89%？

谢谢。

Answer 1

当您查看它的 source code 时，您会发现 SelectNodes 方法必须做更多的工作，例如评估 XPath 和查找节点：

public HtmlNodeCollection SelectNodes(string xpath)
{
    HtmlNodeCollection list = new HtmlNodeCollection(null);

    HtmlNodeNavigator nav = new HtmlNodeNavigator(_ownerdocument, this);
    XPathNodeIterator it = nav.Select(xpath);
    while (it.MoveNext())
    {
        HtmlNodeNavigator n = (HtmlNodeNavigator) it.Current;
        list.Add(n.CurrentNode);
    }
    if (list.Count == 0)
    {
        return null;
    }
    return list;
}

而 Descendants 方法只是遍历缓存的 ChildNodes 并检查元素的名称：

/// <summary>
/// Get all descendant nodes with matching name
/// </summary>
/// <param name="name"></param>
/// <returns></returns>
public IEnumerable<HtmlNode> Descendants(string name)
{
    foreach (HtmlNode node in Descendants())
        if (node.Name == name)
            yield return node;
}

上述调用中使用的其他辅助方法：

/// <summary>
/// Gets all Descendant nodes for this node and each of child nodes
/// </summary>
/// <returns></returns>
public IEnumerable<HtmlNode> DescendantNodes()
{
    foreach (HtmlNode node in ChildNodes)
    {
        yield return node;
        foreach (HtmlNode descendant in node.DescendantNodes())
            yield return descendant;
    }
}


/// <summary>
/// Gets all Descendant nodes in enumerated list
/// </summary>
/// <returns></returns>
public IEnumerable<HtmlNode> Descendants()
{
    foreach (HtmlNode node in DescendantNodes())
    {
        yield return node;
    }
}

Answer 2

一个区别是解析 XPath 表达式的成本，但我不希望这会导致这样的差异。从@t3chb0t 给出的源代码来看，主要区别似乎是 XPath 解决方案在内存中构建列表，而直接方法 returns 是迭代器。您没有说选择了多少元素，但是构建列表是要付出代价的：这似乎是设计不当的必然结果 API.

为什么这段代码在查找 HTML 元素时比其他代码执行速度快 89%？有什么不同？

Why this code executes 89% faster than the other to find HTML elements? What is the difference?

c#

performance

xpath

html-agility-pack