如何使用 htmlagilitypack 提取 div 标签内的文本

Question

我想提取 div class 之间的文本 "Some text goes here"。我正在使用 html 敏捷包和 c#

<div class="productDescriptionWrapper">
Some Text Goes here...
<div class="emptyClear"> </div>
</div>

这就是我所拥有的：

Description = doc.DocumentNode.SelectNodes("//div[@class=\"productDescriptionWrapper\").Descendants("div").Select(x => x.InnerText).ToList();

我收到这个错误：

An unhandled exception of type 'System.NullReferenceException'

我知道如何提取如果文本是 b/w 一个 <h1> 或 <p> 而不是 "div" 在后代我将不得不给 "h1"或 "p".

有人请帮忙。

Answer 1

使用单引号如

//div[@class='productDescriptionWrapper']

获取所有类型的所有后代使用：

//div[@class='productDescriptionWrapper']//*,

获取特定类型的所有后代例如 p 然后使用 //div[@class='productDescriptionWrapper']//p.

获取所有 div 或 p 的后代：

//div[@class='productDescriptionWrapper']//*[self::div or self::p]

假设您想获得所有非空白后代文本节点，然后使用：

//div[@class='productDescriptionWrapper']//text()[normalize-space()]

Answer 2

鉴于 doc 是根据您发布的 HTML 片段创建的，您无法获得空引用异常。无论如何，如果您打算在外部 <div> 中获取文本，而不是从内部获取文本，则使用 xpath /text()，这意味着 获取直接子文本节点 。

例如，给定这个 HTML 片段：

var html = @"<div class=""productDescriptionWrapper"">
Some Text Goes here...
<div class=""emptyClear"">Don't get this one</div>
</div>";
var doc = new HtmlDocument();
doc.LoadHtml(html);

..此表达式 return 仅来自外部 <div> 的文本：

var Description = doc.DocumentNode
                     .SelectNodes("//div[@class='productDescriptionWrapper']/text()")
                     .Select(x => x.InnerText.Trim())
                     .First();
//Description : 
//"Some Text Goes here..."

..而相比之下，以下return全文：

var Description = doc.DocumentNode
                     .SelectNodes("//div[@class='productDescriptionWrapper']")
                     .Select(x => x.InnerText.Trim())
                     .First();
//Description :
//"Some Text Goes here...
//Don't get this one"

如何使用 htmlagilitypack 提取 div 标签内的文本

How to extract text inside a div tag using htmlagilitypack

html

c#

winforms

html-agility-pack