无法获取属于特定 ul 的 li
Cannot get li that belong to a specific ul
我有这样的结构:
<ul>
<li class="list-group-item px-0">
<h2>Foo</h2>
<ul>
<li class="list-group-item">
<h3>Test</h3>
</li>
</ul>
</li>
<li class="list-group-item px-0">
<h2>Contoso</h2>
<ul>
<li class="list-group-item">
<h3>Test 2</h3>
</li>
</ul>
</li>
</ul>
我正在尝试获取属于迭代中节点的所有 li
,这是第一个 ul
,因此结果应该 return:Foo 和 Contoso 但是我得到了所有可用的 li,这是我的代码:
var liCollection = node.SelectNodes(".//ul/li[@class='list-group-item']");
我可以通过添加 px-0
解决此问题,但是否有可能在迭代中仅获取与第一个 ul
关联的 li?
完整代码:
我做了一个样品来满足你的需要。我认为这就是您想要实现的目标!
var list = doc.DocumentNode.SelectNodes(
"//div[@class='shadow-sm autoscroll my-1']");
var collection = list.Select(x => x.SelectNodes(".//ul/li[@class='list-group-item']"));
//This is for "A", "B" etc
var category = list.Select(x => x.SelectNodes(".//span[contains(@class, 'badge-light')]"));
//This is for "A01A" etc
var listTitles = list.Select(x => x.SelectNodes(".//ul/li[@class='list-group-item']//span"));
//This is for "Preparazioni stomatologiche" etc
var descriptions = list.Select(x => x.SelectNodes(".//ul/li[@class='list-group-item']//a"));
以此为指导,您可以抓取您真正想要的数据..
更新
合并在一起:
var doc = new HtmlDocument();
doc.Load(Directory.GetCurrentDirectory() + "/html.txt");
var data = doc.DocumentNode.SelectNodes("//div[@class='shadow-sm autoscroll my-1']");
List<dynamic> objects = new();
foreach (var item in data)
{
foreach (var sub in item.SelectNodes(".//ul[contains(@class, 'list-group')]//li"))
{
var obj = new
{
Category = item.SelectSingleNode(".//div[@class='mb-1']//span").InnerText.Trim(),
Description = item.SelectSingleNode(".//div[@class='mb-1']//h2").InnerText.Trim(),
Sub = new
{
SubCategories = sub.SelectSingleNode(".//span").InnerText.Trim(),
SubDescriptions = sub.SelectSingleNode(".//a").InnerText.Trim(),
}
};
objects.Add(obj);
}
}
var json = JsonSerializer.Serialize(objects, new JsonSerializerOptions { WriteIndented = true });
我做了一个完全不同的选择:
html1 = File.ReadAllText("input.html");
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html1);
var i = 0;
var uls = htmlDoc.DocumentNode.SelectNodes("//span[@class]/../../div[1]/*");
foreach (HtmlNode ul in uls)
{
var group = ul.InnerText.Replace('\r',' ').Replace('\n',' ').Trim();
foreach( HtmlNode subul in ul.SelectNodes("./../../div[2]/*"))
{
var sub = subul.InnerText.Trim();
if (!string.IsNullOrEmpty(sub)) Console.WriteLine($"{group}: {sub}");
}
}
输出:
A: Apparato gastrointestinale e metabolismo
A01: Preparati stomatologici
A01A: Preparazioni stomatologiche
A02: Farmaci per malattie correlate all'acidosi
A02A: Antiacidi
A02B: Farmaci per l'ulcera peptica e la malattia da reflusso gastroesofageo (gerd)
A03: Farmaci per malattie gastrointestinali funzionali
A03A: Farmaci malattie gastrointestinali funzionali
A03B: Belladonna e derivati
A03F: Procinetici
A04: Antiemetici e antinausea
A04A: Antiemetici e antinausea
A05: Bile e terapia del fegato
A05A: Terapia per la bile
...
我有这样的结构:
<ul>
<li class="list-group-item px-0">
<h2>Foo</h2>
<ul>
<li class="list-group-item">
<h3>Test</h3>
</li>
</ul>
</li>
<li class="list-group-item px-0">
<h2>Contoso</h2>
<ul>
<li class="list-group-item">
<h3>Test 2</h3>
</li>
</ul>
</li>
</ul>
我正在尝试获取属于迭代中节点的所有 li
,这是第一个 ul
,因此结果应该 return:Foo 和 Contoso 但是我得到了所有可用的 li,这是我的代码:
var liCollection = node.SelectNodes(".//ul/li[@class='list-group-item']");
我可以通过添加 px-0
解决此问题,但是否有可能在迭代中仅获取与第一个 ul
关联的 li?
完整代码:
我做了一个样品来满足你的需要。我认为这就是您想要实现的目标!
var list = doc.DocumentNode.SelectNodes(
"//div[@class='shadow-sm autoscroll my-1']");
var collection = list.Select(x => x.SelectNodes(".//ul/li[@class='list-group-item']"));
//This is for "A", "B" etc
var category = list.Select(x => x.SelectNodes(".//span[contains(@class, 'badge-light')]"));
//This is for "A01A" etc
var listTitles = list.Select(x => x.SelectNodes(".//ul/li[@class='list-group-item']//span"));
//This is for "Preparazioni stomatologiche" etc
var descriptions = list.Select(x => x.SelectNodes(".//ul/li[@class='list-group-item']//a"));
以此为指导,您可以抓取您真正想要的数据..
更新
合并在一起:
var doc = new HtmlDocument();
doc.Load(Directory.GetCurrentDirectory() + "/html.txt");
var data = doc.DocumentNode.SelectNodes("//div[@class='shadow-sm autoscroll my-1']");
List<dynamic> objects = new();
foreach (var item in data)
{
foreach (var sub in item.SelectNodes(".//ul[contains(@class, 'list-group')]//li"))
{
var obj = new
{
Category = item.SelectSingleNode(".//div[@class='mb-1']//span").InnerText.Trim(),
Description = item.SelectSingleNode(".//div[@class='mb-1']//h2").InnerText.Trim(),
Sub = new
{
SubCategories = sub.SelectSingleNode(".//span").InnerText.Trim(),
SubDescriptions = sub.SelectSingleNode(".//a").InnerText.Trim(),
}
};
objects.Add(obj);
}
}
var json = JsonSerializer.Serialize(objects, new JsonSerializerOptions { WriteIndented = true });
我做了一个完全不同的选择:
html1 = File.ReadAllText("input.html");
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html1);
var i = 0;
var uls = htmlDoc.DocumentNode.SelectNodes("//span[@class]/../../div[1]/*");
foreach (HtmlNode ul in uls)
{
var group = ul.InnerText.Replace('\r',' ').Replace('\n',' ').Trim();
foreach( HtmlNode subul in ul.SelectNodes("./../../div[2]/*"))
{
var sub = subul.InnerText.Trim();
if (!string.IsNullOrEmpty(sub)) Console.WriteLine($"{group}: {sub}");
}
}
输出:
A: Apparato gastrointestinale e metabolismo
A01: Preparati stomatologici
A01A: Preparazioni stomatologiche
A02: Farmaci per malattie correlate all'acidosi
A02A: Antiacidi
A02B: Farmaci per l'ulcera peptica e la malattia da reflusso gastroesofageo (gerd)
A03: Farmaci per malattie gastrointestinali funzionali
A03A: Farmaci malattie gastrointestinali funzionali
A03B: Belladonna e derivati
A03F: Procinetici
A04: Antiemetici e antinausea
A04A: Antiemetici e antinausea
A05: Bile e terapia del fegato
A05A: Terapia per la bile
...