HtmlAgilityPack 解析属性
HtmlAgilityPack parse attributes
我正在尝试解析 HTML 但我不知道如何使用条件(例如 class 名称必须是 X)。我知道有很多关于敏捷包的话题,但我找不到任何有用的话题。
<div class="main-class">
<a href="LINK">
<img src="IMAGELINK" alt="SOMETEXT" class="image-class">
</a>
</div>
<p> bla bla </p>
<div class="main-class">
<a href="LINK">
<img src="IMAGELINK" alt="SOMETEXT" class="image-class">
</a>
</div>
<div class="main-class">
<a href="LINK">
<img src="IMAGELINK" alt="SOMETEXT" class="image-class">
</a>
<p> asd sadh awww </p>
</div>
我想为每个 div 名称 class "main-class" 获取 href、src 和 alt,
这是我的代码,但它只打印 "p" 因为这是我唯一知道该怎么做的事情。
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(dataString);
foreach (HtmlNode nodeItem in doc.DocumentNode.Descendants("p").ToArray())
{
Debug.WriteLine(nodeItem.InnerText);
}
我正在开发不支持 "SelectNodes" 的 WP 应用程序
通过使用传统的非 XPath 方式。
注意:省略了可为空值的检查。
string dataString = "<div class=\"main-class\"><a href=\"LINK\"><img src=\"IMAGELINK\" alt=\"SOMETEXT\" class=\"image-class\"></a></div><p> bla bla </p><div class=\"main-class\"><a href=\"LINK\"><img src=\"IMAGELINK\" alt=\"SOMETEXT\" class=\"image-class\"></a></div><div class=\"main-class\"><a href=\"LINK\"><img src=\"IMAGELINK\" alt=\"SOMETEXT\" class=\"image-class\"></a><p> asd sadh awww </p></div>";
var doc = new HtmlDocument();
doc.LoadHtml(dataString);
var elements = doc.DocumentNode.Descendants("div").Where(o => o.GetAttributeValue("class", "") == "main-class");
foreach (var nodeItem in elements)
{
var aTag = nodeItem.Descendants("a").First();
var aTagHrefValue = aTag.Attributes["href"];
var imgTag = nodeItem.Descendants("img").First();
var imgTagSrcValue = imgTag.Attributes["src"];
var imgTagAltValue = imgTag.Attributes["alt"];
Console.WriteLine("a href value: {0}", aTagHrefValue.Value);
Console.WriteLine("img src value: {0}", imgTagSrcValue.Value);
Console.WriteLine("img alt value: {0}", imgTagAltValue.Value);
Console.WriteLine();
}
@Orel Eraki - 谢谢。虽然我在 3 分钟前自己做了,但我会改用您的解决方案,因为它只有一个 foreach 循环。无论如何,这是我的解决方案
foreach (HtmlNode nodeItem in doc.DocumentNode.Descendants("div").Where(p => p.GetAttributeValue("class", "def").Equals("main-class")))
{
foreach (HtmlNode nodeAItem in nodeItem.Descendants("a"))
{
Debug.WriteLine(nodeAItem.GetAttributeValue("href", "def"));
foreach (HtmlNode nodeIMAGEitem in nodeAItem.Descendants("img"))
{
Debug.WriteLine(nodeIMAGEitem.GetAttributeValue("src", "def"));
Debug.WriteLine(nodeIMAGEitem.GetAttributeValue("alt", "def"));
}
}
}
您可以为此使用 LINQ
var attrs = doc.DocumentNode
.Descendants("div")
.Where(d => d.Attributes != null &&
d.Attributes.Contains("class") &&
d.Attributes["class"].Value.Contains("main-class"))
.Select(d => new
{
anchor = d.SelectSingleNode("a"),
img = d.SelectSingleNode("a") != null
? d.SelectSingleNode("a").SelectSingleNode("img")
: null
})
.Select(d => new
{
href = d.anchor != null
? d.anchor.GetAttributeValue("href", string.Empty)
: string.Empty,
imgsrc = d.img != null
? d.img.GetAttributeValue("src", string.Empty)
: string.Empty,
imgalt = d.img != null
? d.img.GetAttributeValue("alt", string.Empty)
: string.Empty
})
.ToList();
我正在尝试解析 HTML 但我不知道如何使用条件(例如 class 名称必须是 X)。我知道有很多关于敏捷包的话题,但我找不到任何有用的话题。
<div class="main-class">
<a href="LINK">
<img src="IMAGELINK" alt="SOMETEXT" class="image-class">
</a>
</div>
<p> bla bla </p>
<div class="main-class">
<a href="LINK">
<img src="IMAGELINK" alt="SOMETEXT" class="image-class">
</a>
</div>
<div class="main-class">
<a href="LINK">
<img src="IMAGELINK" alt="SOMETEXT" class="image-class">
</a>
<p> asd sadh awww </p>
</div>
我想为每个 div 名称 class "main-class" 获取 href、src 和 alt, 这是我的代码,但它只打印 "p" 因为这是我唯一知道该怎么做的事情。
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(dataString);
foreach (HtmlNode nodeItem in doc.DocumentNode.Descendants("p").ToArray())
{
Debug.WriteLine(nodeItem.InnerText);
}
我正在开发不支持 "SelectNodes" 的 WP 应用程序
通过使用传统的非 XPath 方式。
注意:省略了可为空值的检查。
string dataString = "<div class=\"main-class\"><a href=\"LINK\"><img src=\"IMAGELINK\" alt=\"SOMETEXT\" class=\"image-class\"></a></div><p> bla bla </p><div class=\"main-class\"><a href=\"LINK\"><img src=\"IMAGELINK\" alt=\"SOMETEXT\" class=\"image-class\"></a></div><div class=\"main-class\"><a href=\"LINK\"><img src=\"IMAGELINK\" alt=\"SOMETEXT\" class=\"image-class\"></a><p> asd sadh awww </p></div>";
var doc = new HtmlDocument();
doc.LoadHtml(dataString);
var elements = doc.DocumentNode.Descendants("div").Where(o => o.GetAttributeValue("class", "") == "main-class");
foreach (var nodeItem in elements)
{
var aTag = nodeItem.Descendants("a").First();
var aTagHrefValue = aTag.Attributes["href"];
var imgTag = nodeItem.Descendants("img").First();
var imgTagSrcValue = imgTag.Attributes["src"];
var imgTagAltValue = imgTag.Attributes["alt"];
Console.WriteLine("a href value: {0}", aTagHrefValue.Value);
Console.WriteLine("img src value: {0}", imgTagSrcValue.Value);
Console.WriteLine("img alt value: {0}", imgTagAltValue.Value);
Console.WriteLine();
}
@Orel Eraki - 谢谢。虽然我在 3 分钟前自己做了,但我会改用您的解决方案,因为它只有一个 foreach 循环。无论如何,这是我的解决方案
foreach (HtmlNode nodeItem in doc.DocumentNode.Descendants("div").Where(p => p.GetAttributeValue("class", "def").Equals("main-class")))
{
foreach (HtmlNode nodeAItem in nodeItem.Descendants("a"))
{
Debug.WriteLine(nodeAItem.GetAttributeValue("href", "def"));
foreach (HtmlNode nodeIMAGEitem in nodeAItem.Descendants("img"))
{
Debug.WriteLine(nodeIMAGEitem.GetAttributeValue("src", "def"));
Debug.WriteLine(nodeIMAGEitem.GetAttributeValue("alt", "def"));
}
}
}
您可以为此使用 LINQ
var attrs = doc.DocumentNode
.Descendants("div")
.Where(d => d.Attributes != null &&
d.Attributes.Contains("class") &&
d.Attributes["class"].Value.Contains("main-class"))
.Select(d => new
{
anchor = d.SelectSingleNode("a"),
img = d.SelectSingleNode("a") != null
? d.SelectSingleNode("a").SelectSingleNode("img")
: null
})
.Select(d => new
{
href = d.anchor != null
? d.anchor.GetAttributeValue("href", string.Empty)
: string.Empty,
imgsrc = d.img != null
? d.img.GetAttributeValue("src", string.Empty)
: string.Empty,
imgalt = d.img != null
? d.img.GetAttributeValue("alt", string.Empty)
: string.Empty
})
.ToList();