通过 HtmlAgilityPack(节点集合)从 href 标签中提取链接
Extract links from href tag via HtmlAgilityPack (nodes collection)
尝试通过 XPath 提取节点时遇到问题...我正在尝试从 <a>
class 的 href
标记中提取 links, html 代码本身如下所示:
<span class="purchase-attachment"><a class="purchase-attachment__downloadLink fileLink" href="https://example.com" target="_blank" data-host="example.com" title="title"><span class="purchase-attachment__icon purchase-attachment__docIcon"><svg viewBox="0 0 16 16" class="_389HCiTc17xVEAZm1afRWB" fill="currentColor" focusable="false"><path fill-rule="evenodd" d="M10.0029297,10.9990234 L4.99902344,10.9990234 L4.99902344,10.0009766 L10.0029297,10.0009766 L10.0029297,10.9990234 Z M10.0029297,9.00292969 L4.99902344,9.00292969 L4.99902344,7.99804688 L10.0029297,7.99804688 L10.0029297,9.00292969 Z M8.99804688,7 L4.99902344,7 L4.99902344,6.00195312 L8.99804688,6.00195312 L8.99804688,7 Z M4.00097656,3.99902344 L4.00097656,13.0019531 L11.0009766,13.0019531 L11.0009766,7 L8,3.99902344 L4.00097656,3.99902344 Z M4.00097656,14 C3.70019381,14 3.45865977,13.9088551 3.27636719,13.7265625 C3.09407461,13.5442699 3.00292969,13.3027359 3.00292969,13.0019531 L3.00292969,3.99902344 C3.00292969,3.69824068 3.09407461,3.45670664 3.27636719,3.27441406 C3.45865977,3.09212148 3.70019381,3.00097656 4.00097656,3.00097656 L8.41699219,3.00097656 L11.9990234,6.58300781 L11.9990234,13.0019531 C11.9990234,13.2845066 11.898764,13.5214834 11.6982422,13.7128906 C11.4977204,13.9042978 11.2653008,14 11.0009766,14 L4.00097656,14 Z"></path></svg></span><span class="purchase-attachment__icon purchase-attachment__externalIcon"><svg viewBox="0 0 16 16" class="_389HCiTc17xVEAZm1afRWB" fill="currentColor" focusable="false"><path d="M14 4H11L12.1464 5.14644L7.64645 9.64642L8.35356 10.3535L12.8535 5.85355L14 7V4Z"></path><path d="M4 6H9.87866L7.87866 8H4V12H11V9.1213L12 8.1213V12C12 12.5523 11.5523 13 11 13H4C3.44772 13 3 12.5523 3 12V7C3 6.44772 3.44772 6 4 6Z"></path></svg></span><span class="purchase-attachment__fullName"><span class="purchase-attachment__fileName">name</span><span class="purchase-attachment__extension">.doc</span></span></a></span>
我的代码如下所示:
doc.DocumentNode.SelectNodes("//span[@class='purchase-attachment']//a[@class='purchase-attachment__downloadLink fileLink']")
在我得到的输出中:没有
我是新手,我仍然很难使用 XPath ...
但最终,我想在 href
标签(https://example.com)之后获取 link 的 InnerText
。
这些 link 位于“a”class 内,紧跟在“span class = 'purchase-attachment'”
之后
请教如何在href
标签中正确写出提取InnerText
的表达式?
这里的 XPath 错误。让我们修复。
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@class='purchase-attachment__downloadLink fileLink']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.Attributes["href"].Value);
Console.WriteLine(node.InnerText);
}
如果您喜欢 JQuery 或 JS QuerySelector
,您可以安装 HtmlAgilityPack
的扩展程序:Fizzler.Systems.HtmlAgilityPack
那么查询对于 Web 开发人员来说会更友好:
HtmlNodeCollection nodes = doc.DocumentNode.QuerySelectorAll("a.purchase-attachment__downloadLink.fileLink");
尝试通过 XPath 提取节点时遇到问题...我正在尝试从 <a>
class 的 href
标记中提取 links, html 代码本身如下所示:
<span class="purchase-attachment"><a class="purchase-attachment__downloadLink fileLink" href="https://example.com" target="_blank" data-host="example.com" title="title"><span class="purchase-attachment__icon purchase-attachment__docIcon"><svg viewBox="0 0 16 16" class="_389HCiTc17xVEAZm1afRWB" fill="currentColor" focusable="false"><path fill-rule="evenodd" d="M10.0029297,10.9990234 L4.99902344,10.9990234 L4.99902344,10.0009766 L10.0029297,10.0009766 L10.0029297,10.9990234 Z M10.0029297,9.00292969 L4.99902344,9.00292969 L4.99902344,7.99804688 L10.0029297,7.99804688 L10.0029297,9.00292969 Z M8.99804688,7 L4.99902344,7 L4.99902344,6.00195312 L8.99804688,6.00195312 L8.99804688,7 Z M4.00097656,3.99902344 L4.00097656,13.0019531 L11.0009766,13.0019531 L11.0009766,7 L8,3.99902344 L4.00097656,3.99902344 Z M4.00097656,14 C3.70019381,14 3.45865977,13.9088551 3.27636719,13.7265625 C3.09407461,13.5442699 3.00292969,13.3027359 3.00292969,13.0019531 L3.00292969,3.99902344 C3.00292969,3.69824068 3.09407461,3.45670664 3.27636719,3.27441406 C3.45865977,3.09212148 3.70019381,3.00097656 4.00097656,3.00097656 L8.41699219,3.00097656 L11.9990234,6.58300781 L11.9990234,13.0019531 C11.9990234,13.2845066 11.898764,13.5214834 11.6982422,13.7128906 C11.4977204,13.9042978 11.2653008,14 11.0009766,14 L4.00097656,14 Z"></path></svg></span><span class="purchase-attachment__icon purchase-attachment__externalIcon"><svg viewBox="0 0 16 16" class="_389HCiTc17xVEAZm1afRWB" fill="currentColor" focusable="false"><path d="M14 4H11L12.1464 5.14644L7.64645 9.64642L8.35356 10.3535L12.8535 5.85355L14 7V4Z"></path><path d="M4 6H9.87866L7.87866 8H4V12H11V9.1213L12 8.1213V12C12 12.5523 11.5523 13 11 13H4C3.44772 13 3 12.5523 3 12V7C3 6.44772 3.44772 6 4 6Z"></path></svg></span><span class="purchase-attachment__fullName"><span class="purchase-attachment__fileName">name</span><span class="purchase-attachment__extension">.doc</span></span></a></span>
我的代码如下所示:
doc.DocumentNode.SelectNodes("//span[@class='purchase-attachment']//a[@class='purchase-attachment__downloadLink fileLink']")
在我得到的输出中:没有
我是新手,我仍然很难使用 XPath ...
但最终,我想在 href
标签(https://example.com)之后获取 link 的 InnerText
。
这些 link 位于“a”class 内,紧跟在“span class = 'purchase-attachment'”
请教如何在href
标签中正确写出提取InnerText
的表达式?
这里的 XPath 错误。让我们修复。
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//a[@class='purchase-attachment__downloadLink fileLink']");
foreach (HtmlNode node in nodes)
{
Console.WriteLine(node.Attributes["href"].Value);
Console.WriteLine(node.InnerText);
}
如果您喜欢 JQuery 或 JS QuerySelector
,您可以安装 HtmlAgilityPack
的扩展程序:Fizzler.Systems.HtmlAgilityPack
那么查询对于 Web 开发人员来说会更友好:
HtmlNodeCollection nodes = doc.DocumentNode.QuerySelectorAll("a.purchase-attachment__downloadLink.fileLink");