如何使用 Powershell Core 7 解析 HTML table?
How to parse HTML table with Powershell Core 7?
我有以下代码:
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content -Path $FilePath -Raw
try
{
$html.IHTMLDocument2_write($source) 2> $null
}
catch
{
$encoded = [Text.Encoding]::Unicode.GetBytes($source)
$html.write($encoded)
}
$t = $html.getElementsByTagName("table") | Where-Object {
$cells = $_.tBodies[0].rows[0].cells
$cells[0].innerText -eq "Name" -and
$cells[1].innerText -eq "Description" -and
$cells[2].innerText -eq "Default Value" -and
$cells[3].innerText -eq "Release"
}
代码在 Windows Powershell 5.1 上工作正常,但在 Powershell Core 7 $_.tBodies[0].rows
returns 上无效。
那么,如何访问 PS 7 中 HTML table 的行?
PowerShell [Core],从 7.2.1 开始,不 是否带有内置 HTML 解析器 .
您必须依赖第三方解决方案,例如PowerHTML
module that wraps the HTML Agility Pack.
对象模型 的工作方式不同于 Windows PowerShell 中可用的基于 Internet Explorer 的模型;它与标准 System.Xml.XmlDocument
type[1]; see the documentation 和下面的示例代码提供的 XML DOM 相似。
# Install the module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PowerHTML)) {
Write-Verbose "Installing PowerHTML module for the current user..."
Install-Module PowerHTML -ErrorAction Stop
}
Import-Module -ErrorAction Stop PowerHTML
# Create a sample HTML file with a table with 2 columns.
Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html
# Parse the HTML file into an HTML DOM.
$htmlDom = ConvertFrom-Html -Path sample.html
# Find a specific table by its column names, using an XPath
# query to iterate over all tables.
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Name' -and
$headerRow.ChildNodes[1].InnerText -eq 'Mode'
}
# Print the table's HTML text.
$table.InnerHtml
# Extract the first data row's first column value.
# Note: @(...) is required around .Elements() for indexing to work.
@($table.Elements('tr'))[1].ChildNodes[0].InnerText
[1] 特别是关于通过 .SelectSingleNode()
和 .SelectNodes()
方法支持 XPath 查询,通过 .ChildNodes
集合公开子节点,并提供 .InnerHtml
/ .OuterHtml
/ .InnerText
属性。代替支持子元素名称的 indexer,提供方法 .Element(<name>)
和 .Elements(<name>)
。
我使用上面的答案作为我的解决方案。我安装了 PowerHTML。
我想从 https://www.dicomlibrary.com/dicom/dicom-tags/ 中提取数据表并进行转换。
来自这里:
<tr><td>(0002,0000)</td><td>UL</td><td>File Meta Information Group Length</td><td></td></tr>
为此:
{"00020000", "ULFile Meta Information Group Length"}
$page = Invoke-WebRequest https://www.dicomlibrary.com/dicom/dicom-tags/
$htmldom = ConvertFrom-Html $page
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Tag'
}
foreach ($row in $table.SelectNodes('tr'))
{$a = $row.SelectSingleNode('td[1]').innerText.Trim() -replace "`n|`r|\s+", " " -replace "\(",'{"' -replace ",","" -replace "\)",'",'
$c = $row.SelectSingleNode('td[3]').innerText.Trim() -replace "`n|`r|\s+", " "
$b=$row.seletSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", ""; $c = '"'+$b+$c+'"},'
$row = New-Object -TypeName psobject
$row | Add-Member -MemberType NoteProperty -Name Tag -Value $a
$row | Add-Member -MemberType NoteProperty -Name Value -Value $c
[array]$data += $row
}
$data | Out-File c:\scripts\dd.txt
我有以下代码:
$html = New-Object -ComObject "HTMLFile"
$source = Get-Content -Path $FilePath -Raw
try
{
$html.IHTMLDocument2_write($source) 2> $null
}
catch
{
$encoded = [Text.Encoding]::Unicode.GetBytes($source)
$html.write($encoded)
}
$t = $html.getElementsByTagName("table") | Where-Object {
$cells = $_.tBodies[0].rows[0].cells
$cells[0].innerText -eq "Name" -and
$cells[1].innerText -eq "Description" -and
$cells[2].innerText -eq "Default Value" -and
$cells[3].innerText -eq "Release"
}
代码在 Windows Powershell 5.1 上工作正常,但在 Powershell Core 7 $_.tBodies[0].rows
returns 上无效。
那么,如何访问 PS 7 中 HTML table 的行?
PowerShell [Core],从 7.2.1 开始,不 是否带有内置 HTML 解析器 .
您必须依赖第三方解决方案,例如PowerHTML
module that wraps the HTML Agility Pack.
对象模型 的工作方式不同于 Windows PowerShell 中可用的基于 Internet Explorer 的模型;它与标准 System.Xml.XmlDocument
type[1]; see the documentation 和下面的示例代码提供的 XML DOM 相似。
# Install the module on demand
If (-not (Get-Module -ErrorAction Ignore -ListAvailable PowerHTML)) {
Write-Verbose "Installing PowerHTML module for the current user..."
Install-Module PowerHTML -ErrorAction Stop
}
Import-Module -ErrorAction Stop PowerHTML
# Create a sample HTML file with a table with 2 columns.
Get-Item $HOME | Select-Object Name, Mode | ConvertTo-Html > sample.html
# Parse the HTML file into an HTML DOM.
$htmlDom = ConvertFrom-Html -Path sample.html
# Find a specific table by its column names, using an XPath
# query to iterate over all tables.
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Name' -and
$headerRow.ChildNodes[1].InnerText -eq 'Mode'
}
# Print the table's HTML text.
$table.InnerHtml
# Extract the first data row's first column value.
# Note: @(...) is required around .Elements() for indexing to work.
@($table.Elements('tr'))[1].ChildNodes[0].InnerText
[1] 特别是关于通过 .SelectSingleNode()
和 .SelectNodes()
方法支持 XPath 查询,通过 .ChildNodes
集合公开子节点,并提供 .InnerHtml
/ .OuterHtml
/ .InnerText
属性。代替支持子元素名称的 indexer,提供方法 .Element(<name>)
和 .Elements(<name>)
。
我使用上面的答案作为我的解决方案。我安装了 PowerHTML。 我想从 https://www.dicomlibrary.com/dicom/dicom-tags/ 中提取数据表并进行转换。
来自这里:
<tr><td>(0002,0000)</td><td>UL</td><td>File Meta Information Group Length</td><td></td></tr>
为此:
{"00020000", "ULFile Meta Information Group Length"}
$page = Invoke-WebRequest https://www.dicomlibrary.com/dicom/dicom-tags/
$htmldom = ConvertFrom-Html $page
$table = $htmlDom.SelectNodes('//table') | Where-Object {
$headerRow = $_.Element('tr') # or $tbl.Elements('tr')[0]
# Filter by column names
$headerRow.ChildNodes[0].InnerText -eq 'Tag'
}
foreach ($row in $table.SelectNodes('tr'))
{$a = $row.SelectSingleNode('td[1]').innerText.Trim() -replace "`n|`r|\s+", " " -replace "\(",'{"' -replace ",","" -replace "\)",'",'
$c = $row.SelectSingleNode('td[3]').innerText.Trim() -replace "`n|`r|\s+", " "
$b=$row.seletSingleNode('td[2]').innerText.Trim() -replace "`n|`r|\s+", ""; $c = '"'+$b+$c+'"},'
$row = New-Object -TypeName psobject
$row | Add-Member -MemberType NoteProperty -Name Tag -Value $a
$row | Add-Member -MemberType NoteProperty -Name Value -Value $c
[array]$data += $row
}
$data | Out-File c:\scripts\dd.txt