如何通过 class 属性解析重复的 HTML 元素?
How can I parse repeated HTML elements by their class attribute?
我正在尝试解析具有基本相同标签的 HTML 文件。
我想要得到这个输出:
BTC - Bitcoin, BEP20(BSC), Bitcoin(Segwit)
ETH - ERC20, BEP20(BSC), POLYGON, ARBITRUM, AURORA, MATISEVM
USDT - OMNI,TRC20,ERC20,BEP20(BSC),HECO,POLYGON,FTM, AVAX-C ,ARBITRUM,METISEVM
QASH - ERC20
这是 HTML 的示例:
<div data-v-326d86f4="" class="table-box">
<table data-v-326d86f4="">
<tr data-v-326d86f4="">
<td data-v-326d86f4="">BTC</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">Bitcoin</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
<div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">Bitcoin</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">Bitcoin(SegWit)</span></div>
</td>
<td data-v-326d86f4="">0.001</td>
<td data-v-326d86f4="">0.002</td>
</tr>
<tr data-v-326d86f4="">
<td data-v-326d86f4="">ETH</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">ERC20</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
<div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">AURORA</span><span data-v-326d86f4="">METISEVM</span></div>
</td>
<td data-v-326d86f4="">0.012</td>
<td data-v-326d86f4="">0.024</td>
</tr>
<tr data-v-326d86f4="">
<td data-v-326d86f4="">USDT</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">OMNI</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
<div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">OMNI</span><span data-v-326d86f4="">TRC20</span><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">HECO</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">FTM</span><span data-v-326d86f4="">AVAX-C</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">METISEVM</span></div>
</td>
<td data-v-326d86f4="">30</td>
<td data-v-326d86f4="">50</td>
</tr>
<tr data-v-326d86f4="">
<td data-v-326d86f4="">QASH</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box">
<span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
</div>
<!---->
</td>
<td data-v-326d86f4="">513</td>
<td data-v-326d86f4="">1026</td>
</tr>
<!-- ... -->
我正在使用 HtmlAgilityPack
库但没有成功:
Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
If myCells IsNot Nothing Then
Dim myToken As String = myCells(0).InnerText
Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
If mySpans IsNot Nothing Then
Dim myListBChain As New List(Of String)
For Each mySpan As HtmlAgilityPack.HtmlNode In mySpans
RichTextBox1.Text += mySpan.InnerText
Next
Dim allItensAsString = String.Join(", ", richtextbox1.text)
End If
End If
Next
这个 return 这个输出:
BitcoinBEP20(BSC)Bitcoin(SegWit)ERC20BEP20(BSC)POLYGONARBITRUMAURORAMETISEVMOMNITRC20ERC20BEP20(BSC)HECOPOLYGONFTMAVAX-CARBITRUMMETISEVMEOSBEP20(BSC)ERC20BEP20(BSC)TRC20BEP20(BSC)ZILBEP20(BSC)NEOLEGACYNEON3ERC20POLYGONERC20DAGBEP2BEP20(BSC)FTMAVAX-CERC20BEP20(BSC)ERC20BEP20(BSC)ERC20HECOBEP20(BSC)ERC20HECOERC20POLYGONERC20HECOERC20POLYGONERC20BEP20(BSC)BCHBEP20(BSC)ERC20LOOPPOLYGONBEP20(BSC)FTMAVAX-CMETISEVMERC20TOLERC20METAERC20BEP20(BSC)
如何使它成为 return 我想要的输出?
合并 on the original issue,在样本的最后<tr>
...
<tr data-v-326d86f4="">
<td data-v-326d86f4="">QASH</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box">
<span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
</div>
<!---->
</td>
<td data-v-326d86f4="">513</td>
<td data-v-326d86f4="">1026</td>
</tr>
...第二个 <td>
不包含 <div class="select-list" ... >
,所以...
myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
...returns Nothing
,因此 NullReferenceException
.
就构建您想要的输出而言,首先您需要测试是否存在这样的 <div class="select-list" ... >
元素...
If mySpans Is Nothing Then
如果没有,则保存 <div class="chain_box" ... ><span class="chain_name ... >
元素的内容...
Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
"div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
)
chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
我添加了一些额外的处理,以防元素不存在或没有值。
如果有 <div class="select-list" ... >
元素,则保存其子 <span ... >
元素的值,用逗号分隔...
chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
最后,创建一个新行并将其附加到您的文本框...
RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
完整的代码如下所示...
Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
If myCells IsNot Nothing Then
Dim myToken As String = myCells(0).InnerText
Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
Dim chainText As String
If mySpans Is Nothing Then
Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
"div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
)
chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
Else
chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
End If
RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
End If
Next
如果您有一个非常大的输入 HTML 文件,您可能会考虑...
- 将每个迭代的行附加到
StringBuilder
...
outputBuilder.Append($"{myToken} - {chainText}{Environment.NewLine}")
...然后在循环后设置 RichTextBox1.Text
一次...
RichTextBox1.Text = outputBuilder.ToString()
- (假设 WinForms)在循环后调用
RichTextBox1.SuspendLayout()
before the loop and RichTextBox1.ResumeLayout()
...为了提高性能,但是,使用其中一种或两种方法意味着 RichTextBox1
在 HTML 完全处理之前不会显示任何输出。
我正在尝试解析具有基本相同标签的 HTML 文件。
我想要得到这个输出:
BTC - Bitcoin, BEP20(BSC), Bitcoin(Segwit)
ETH - ERC20, BEP20(BSC), POLYGON, ARBITRUM, AURORA, MATISEVM
USDT - OMNI,TRC20,ERC20,BEP20(BSC),HECO,POLYGON,FTM, AVAX-C ,ARBITRUM,METISEVM
QASH - ERC20
这是 HTML 的示例:
<div data-v-326d86f4="" class="table-box">
<table data-v-326d86f4="">
<tr data-v-326d86f4="">
<td data-v-326d86f4="">BTC</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">Bitcoin</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
<div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">Bitcoin</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">Bitcoin(SegWit)</span></div>
</td>
<td data-v-326d86f4="">0.001</td>
<td data-v-326d86f4="">0.002</td>
</tr>
<tr data-v-326d86f4="">
<td data-v-326d86f4="">ETH</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">ERC20</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
<div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">AURORA</span><span data-v-326d86f4="">METISEVM</span></div>
</td>
<td data-v-326d86f4="">0.012</td>
<td data-v-326d86f4="">0.024</td>
</tr>
<tr data-v-326d86f4="">
<td data-v-326d86f4="">USDT</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box"><span data-v-326d86f4="" class="chain_name">OMNI</span> <span data-v-326d86f4=""><i data-v-326d86f4="" class="fa fa-caret-down"></i></span></div>
<div data-v-326d86f4="" class="select-list"><span data-v-326d86f4="">OMNI</span><span data-v-326d86f4="">TRC20</span><span data-v-326d86f4="">ERC20</span><span data-v-326d86f4="">BEP20(BSC)</span><span data-v-326d86f4="">HECO</span><span data-v-326d86f4="">POLYGON</span><span data-v-326d86f4="">FTM</span><span data-v-326d86f4="">AVAX-C</span><span data-v-326d86f4="">ARBITRUM</span><span data-v-326d86f4="">METISEVM</span></div>
</td>
<td data-v-326d86f4="">30</td>
<td data-v-326d86f4="">50</td>
</tr>
<tr data-v-326d86f4="">
<td data-v-326d86f4="">QASH</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box">
<span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
</div>
<!---->
</td>
<td data-v-326d86f4="">513</td>
<td data-v-326d86f4="">1026</td>
</tr>
<!-- ... -->
我正在使用 HtmlAgilityPack
库但没有成功:
Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
If myCells IsNot Nothing Then
Dim myToken As String = myCells(0).InnerText
Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
If mySpans IsNot Nothing Then
Dim myListBChain As New List(Of String)
For Each mySpan As HtmlAgilityPack.HtmlNode In mySpans
RichTextBox1.Text += mySpan.InnerText
Next
Dim allItensAsString = String.Join(", ", richtextbox1.text)
End If
End If
Next
这个 return 这个输出:
BitcoinBEP20(BSC)Bitcoin(SegWit)ERC20BEP20(BSC)POLYGONARBITRUMAURORAMETISEVMOMNITRC20ERC20BEP20(BSC)HECOPOLYGONFTMAVAX-CARBITRUMMETISEVMEOSBEP20(BSC)ERC20BEP20(BSC)TRC20BEP20(BSC)ZILBEP20(BSC)NEOLEGACYNEON3ERC20POLYGONERC20DAGBEP2BEP20(BSC)FTMAVAX-CERC20BEP20(BSC)ERC20BEP20(BSC)ERC20HECOBEP20(BSC)ERC20HECOERC20POLYGONERC20HECOERC20POLYGONERC20BEP20(BSC)BCHBEP20(BSC)ERC20LOOPPOLYGONBEP20(BSC)FTMAVAX-CMETISEVMERC20TOLERC20METAERC20BEP20(BSC)
如何使它成为 return 我想要的输出?
合并<tr>
...
<tr data-v-326d86f4="">
<td data-v-326d86f4="">QASH</td>
<td data-v-326d86f4="" class="block-chain">
<div data-v-326d86f4="" class="chain_box">
<span data-v-326d86f4="" class="chain_name">ERC20</span> <!---->
</div>
<!---->
</td>
<td data-v-326d86f4="">513</td>
<td data-v-326d86f4="">1026</td>
</tr>
...第二个 <td>
不包含 <div class="select-list" ... >
,所以...
myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
...returns Nothing
,因此 NullReferenceException
.
就构建您想要的输出而言,首先您需要测试是否存在这样的 <div class="select-list" ... >
元素...
If mySpans Is Nothing Then
如果没有,则保存 <div class="chain_box" ... ><span class="chain_name ... >
元素的内容...
Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
"div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
)
chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
我添加了一些额外的处理,以防元素不存在或没有值。
如果有 <div class="select-list" ... >
元素,则保存其子 <span ... >
元素的值,用逗号分隔...
chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
最后,创建一个新行并将其附加到您的文本框...
RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
完整的代码如下所示...
Dim arqHtml As String = "C:\Users\Mattia\Desktop\ready.html"
Dim myHtml As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument()
myHtml.Load(arqHtml)
Dim myTable As HtmlAgilityPack.HtmlNode = myHtml.DocumentNode.SelectSingleNode("//table")
Dim myRows As HtmlAgilityPack.HtmlNodeCollection = myTable.SelectNodes("tr")
For Each tmpRow As HtmlAgilityPack.HtmlNode In myRows
Dim myCells As HtmlAgilityPack.HtmlNodeCollection = tmpRow.SelectNodes("td")
If myCells IsNot Nothing Then
Dim myToken As String = myCells(0).InnerText
Dim mySpans As HtmlAgilityPack.HtmlNodeCollection = myCells(1).SelectNodes("div[contains(@class,'select-list')]/span")
Dim chainText As String
If mySpans Is Nothing Then
Dim chainTextNode As HtmlAgilityPack.HtmlNode = myCells(1).SelectSingleNode(
"div[contains(@class, 'chain_box')]/span[contains(@class, 'chain_name')]"
)
chainText = If(chainTextNode Is Nothing OrElse String.IsNullOrWhiteSpace(chainTextNode.InnerText), "(unknown)", chainTextNode.InnerText)
Else
chainText = String.Join(", ", mySpans.Select(Function(span) span.InnerText))
' Alternative: chainText = String.Join(", ", From span In mySpans Select span.InnerText)
End If
RichTextBox1.Text &= $"{myToken} - {chainText}{Environment.NewLine}"
End If
Next
如果您有一个非常大的输入 HTML 文件,您可能会考虑...
- 将每个迭代的行附加到
StringBuilder
...
...然后在循环后设置outputBuilder.Append($"{myToken} - {chainText}{Environment.NewLine}")
RichTextBox1.Text
一次...RichTextBox1.Text = outputBuilder.ToString()
- (假设 WinForms)在循环后调用
RichTextBox1.SuspendLayout()
before the loop andRichTextBox1.ResumeLayout()
...为了提高性能,但是,使用其中一种或两种方法意味着 RichTextBox1
在 HTML 完全处理之前不会显示任何输出。