如何使用 HtmlAgilityPack 解析 table 中的 <br> 标签?
How to parse <br> tags in table using HtmlAgilityPack?
我有一个 html table,其单元格值由
标记分隔。
<TABLE class=a12 cellSpacing=0 cols=8 cellPadding=0 border=1>
<TBODY>
<TR>
<TD style="WIDTH: 20.32mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
</TR>
<TR style="HEIGHT: 5.08mm">
<TD class=a23><DIV class=r11>Hrs</DIV></TD>
<TD class=a24><DIV class=r11>MON</DIV></TD>
<TD class=a25><DIV class=r11>TUE</DIV></TD>
<TD class=a26><DIV class=r11>WED</DIV></TD>
<TD class=a27><DIV class=r11>THU</DIV></TD>
<TD class=a28><DIV class=r11>FRI</DIV></TD>
<TD class=a29><DIV class=r11>SAT</DIV></TD>
<TD class=a30><DIV class=r11>SUN</DIV></TD>
</TR>
<TR style="HEIGHT: 14.7mm">
<TD class=a59><DIV class=r11>00:00</DIV></TD>
<TD class=a60><DIV class=r11>FGH<BR>BM</DIV></TD>
<TD class=a61><DIV class=r11>RFG8<BR>MFT5</DIV></TD>
<TD class=a62><DIV class=r11>V5B6<BR>FG</DIV></TD>
<TD class=a63><DIV class=r11>VB2N<BR>BN</DIV></TD>
<TD class=a64><DIV class=r11>DFG21</DIV></TD>
<TD class=a65><DIV class=r11>FGH<BR>MD20<BR>DHB0</DIV></TD>
<TD class=a66><DIV class=r11>FD6<BR>HT7H4</DIV></TD>
</TR>
<TR style="HEIGHT: 14.7mm">
<TD class=a59><DIV class=r11>02:00</DIV></TD>
<TD class=a60><DIV class=r11>VN</DIV></TD>
<TD class=a61><DIV class=r11>RTY<BR>MHF</DIV></TD>
<TD class=a62><DIV class=r11>V5B6<BR>FG</DIV></TD>
<TD class=a63><DIV class=r11>ZXC<BR>FHF</DIV></TD>
<TD class=a64><DIV class=r11>DFG21<BR>GH<BR>PKJK</DIV></TD>
<TD class=a65><DIV class=r11>FGH<BR>MD20</DIV></TD>
<TD class=a66><DIV class=r11>FFG<BR>HFG4</DIV></TD>
</TR>
<TR style="HEIGHT: 14.7mm">
<TD class=a59><DIV class=r11>04:00</DIV></TD>
<TD class=a60><DIV class=r11>VNFG</DIV></TD>
<TD class=a61><DIV class=r11>RTY<BR>MHF<br>T54</DIV></TD>
<TD class=a62><DIV class=r11>CNFG</DIV></TD>
<TD class=a63><DIV class=r11>QFCF<BR>FHF</DIV></TD>
<TD class=a64><DIV class=r11>DFG21<BR>GH67</DIV></TD>
<TD class=a65><DIV class=r11>SDF<BR>DFH</DIV></TD>
<TD class=a66><DIV class=r11>CXV<BR>HFG4</DIV></TD>
</TR>
</TBODY>
我尝试将 html table 转换为数据table,但单元格值是串联的。
如何解析
标签,以便单元格值可以用逗号分隔而不是组合在一起?
Private Function ParseTable(doc As HtmlDocument) As DataTable
Dim result As New DataTable()
Dim TableClassA12 As HtmlNode = doc.DocumentNode.SelectSingleNode("//table[@class='a12']")
Dim rows = TableClassA12.Descendants("tr")
Dim header = rows.Skip(1).First()
For Each column In header.Descendants("td")
result.Columns.Add(New DataColumn(column.InnerText.Trim, GetType(String)))
Next
For Each row In rows.Skip(2)
Dim data = New List(Of String)()
For Each column In row.Descendants("td")
Dim cellText As String = column.InnerText.Trim
data.Add(cellText)
Next
If data.Count > 0 Then
result.Rows.Add(data.ToArray())
End If
Next
Return result
End Function
为了对现有代码进行最小的更改,您可以 select td
中的 div
,然后访问 InnerHtml
以获取内部文本以及 <br>
标签。此时您可以简单地将 <br>
标签替换为逗号 :
For Each column In row.Descendants("td").SelectMany(Function(x) x.Elements("div"))
Dim cellText As String = column.InnerHtml.Trim.Replace("<br>",",")
data.Add(cellText)
Next
我有一个 html table,其单元格值由
标记分隔。
<TABLE class=a12 cellSpacing=0 cols=8 cellPadding=0 border=1>
<TBODY>
<TR>
<TD style="WIDTH: 20.32mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
<TD style="WIDTH: 34mm"></TD>
</TR>
<TR style="HEIGHT: 5.08mm">
<TD class=a23><DIV class=r11>Hrs</DIV></TD>
<TD class=a24><DIV class=r11>MON</DIV></TD>
<TD class=a25><DIV class=r11>TUE</DIV></TD>
<TD class=a26><DIV class=r11>WED</DIV></TD>
<TD class=a27><DIV class=r11>THU</DIV></TD>
<TD class=a28><DIV class=r11>FRI</DIV></TD>
<TD class=a29><DIV class=r11>SAT</DIV></TD>
<TD class=a30><DIV class=r11>SUN</DIV></TD>
</TR>
<TR style="HEIGHT: 14.7mm">
<TD class=a59><DIV class=r11>00:00</DIV></TD>
<TD class=a60><DIV class=r11>FGH<BR>BM</DIV></TD>
<TD class=a61><DIV class=r11>RFG8<BR>MFT5</DIV></TD>
<TD class=a62><DIV class=r11>V5B6<BR>FG</DIV></TD>
<TD class=a63><DIV class=r11>VB2N<BR>BN</DIV></TD>
<TD class=a64><DIV class=r11>DFG21</DIV></TD>
<TD class=a65><DIV class=r11>FGH<BR>MD20<BR>DHB0</DIV></TD>
<TD class=a66><DIV class=r11>FD6<BR>HT7H4</DIV></TD>
</TR>
<TR style="HEIGHT: 14.7mm">
<TD class=a59><DIV class=r11>02:00</DIV></TD>
<TD class=a60><DIV class=r11>VN</DIV></TD>
<TD class=a61><DIV class=r11>RTY<BR>MHF</DIV></TD>
<TD class=a62><DIV class=r11>V5B6<BR>FG</DIV></TD>
<TD class=a63><DIV class=r11>ZXC<BR>FHF</DIV></TD>
<TD class=a64><DIV class=r11>DFG21<BR>GH<BR>PKJK</DIV></TD>
<TD class=a65><DIV class=r11>FGH<BR>MD20</DIV></TD>
<TD class=a66><DIV class=r11>FFG<BR>HFG4</DIV></TD>
</TR>
<TR style="HEIGHT: 14.7mm">
<TD class=a59><DIV class=r11>04:00</DIV></TD>
<TD class=a60><DIV class=r11>VNFG</DIV></TD>
<TD class=a61><DIV class=r11>RTY<BR>MHF<br>T54</DIV></TD>
<TD class=a62><DIV class=r11>CNFG</DIV></TD>
<TD class=a63><DIV class=r11>QFCF<BR>FHF</DIV></TD>
<TD class=a64><DIV class=r11>DFG21<BR>GH67</DIV></TD>
<TD class=a65><DIV class=r11>SDF<BR>DFH</DIV></TD>
<TD class=a66><DIV class=r11>CXV<BR>HFG4</DIV></TD>
</TR>
</TBODY>
我尝试将 html table 转换为数据table,但单元格值是串联的。
如何解析
标签,以便单元格值可以用逗号分隔而不是组合在一起?
Private Function ParseTable(doc As HtmlDocument) As DataTable
Dim result As New DataTable()
Dim TableClassA12 As HtmlNode = doc.DocumentNode.SelectSingleNode("//table[@class='a12']")
Dim rows = TableClassA12.Descendants("tr")
Dim header = rows.Skip(1).First()
For Each column In header.Descendants("td")
result.Columns.Add(New DataColumn(column.InnerText.Trim, GetType(String)))
Next
For Each row In rows.Skip(2)
Dim data = New List(Of String)()
For Each column In row.Descendants("td")
Dim cellText As String = column.InnerText.Trim
data.Add(cellText)
Next
If data.Count > 0 Then
result.Rows.Add(data.ToArray())
End If
Next
Return result
End Function
为了对现有代码进行最小的更改,您可以 select td
中的 div
,然后访问 InnerHtml
以获取内部文本以及 <br>
标签。此时您可以简单地将 <br>
标签替换为逗号 :
For Each column In row.Descendants("td").SelectMany(Function(x) x.Elements("div"))
Dim cellText As String = column.InnerHtml.Trim.Replace("<br>",",")
data.Add(cellText)
Next