HTML Table 到 CSV
HTML Table to CSV
我是 powershell 的新手,我使用 JohnLBevan 的这段代码将 HTML Table 转换为 CSV:
function ConvertFrom-HtmlTableRow {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTableRow
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
$headers
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
[switch]$isHeader
)
process {
$cols = $htmlTableRow | select -expandproperty td
if($isHeader.IsPresent) {
0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name
} else {
$colCount = ($cols | Measure-Object).Count - 1
$result = new-object -TypeName PSObject
0..$colCount | %{
$colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers
$colValue = $cols[$_]
$result | Add-Member NoteProperty $colName $colValue
}
write-output $result
}
}
}
function ConvertFrom-HtmlTable {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTable
)
process {
#currently only very basic <table><tr><td>...</td></tr></table> structure supported
#could be improved to better understand tbody, th, nested tables, etc
#$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow
#remove anything tags that aren't td or tr (simplifies our parsing of the data
[xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','') -replace '(</?)(?:th)([^>]*/?>)','td'}))
[string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
if ($headers.Count -gt 0) {
$cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
}
}
}
但是每当我从已解析的HTML变量执行它并获取元素标签名“table”时,我都会收到此错误:
Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>
</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+ [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [], RuntimeException
+ FullyQualifiedErrorId : InvalidCastToXmlDocument
我希望有人能帮助我。提前致谢。
我正在尝试使用外部网站。这是 Table 的 HTML 代码:
<table class="organization-admin__table table">
<thead>
<tr>
<th colspan="2">Name</th>
<th>Email address</th>
<th>Timezone</th>
<th>Last logged in</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 <user1@site.com>" title="user1 <user1@site.com>">
<img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 <user1@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user1</a></td>
<td><a href="mailto:user1@site.com">user1@site.com</a></td>
<td>Canada/Eastern</td>
<td>05 Aug 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 <user2@site.ca>" title="user2 <user2@site.ca>">
<img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 <user2@site.ca>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user2</a></td>
<td><a href="mailto:user2@site.ca">user2@site.ca</a></td>
<td>Canada/Eastern</td>
<td>16 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 <user3@site.com>" title="user3 <user3@site.com>">
<img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred <user3@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">Wilfred</a></td>
<td><a href="mailto:user3@site.com">Wilfred@site.com</a></td>
<td>UTC</td>
<td>26 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
</tbody>
</table>
如评论所述,转换为 XML 有严格的规则,当 HTML 忽略编写结束标记时 </tr>
将其加载为 xml 将失败。相同用于没有结束标记 </img>
.
的 <img>
标记
我没有您正在加载的完整 html,但也许可以试试下面的功能:
function ConvertFrom_HtmlTable {
# adapted from: https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/
[CmdletBinding(DefaultParameterSetName = 'ByIndex')]
param(
[Parameter(Mandatory = $true, Position = 0)]
[Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
[int]$TableIndex = 0,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
[string]$TableId,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
[string]$TableName,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
[string]$TableClassName
)
# Extract the table out of the web request
switch ($PSCmdlet.ParameterSetName) {
'ById' { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
'ByName' { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
}
if (!$table) {
Write-Warning "Could not find the given table."
return $null
}
# load the System.Web assembly to be able to decode HTML entities
Add-Type -AssemblyName System.Web
$headers = @()
# Go through all of the rows in the table
foreach ($row in $table.Rows) {
$cells = @($row.Cells)
# If there is a table header, remember its titles
if($cells[0].tagName -eq "TH") {
$i = 0
$headers = @($cells | ForEach-Object {
$i++
# decode HTML entities and double-up quotes that the value may contain
$th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
# if the table header is empty, create it
if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
})
# proceed with the next row
continue
}
# if we haven't found any table headers, make up names "H1", "H2", etc.
if(-not $headers) {
$headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
}
# Now go through the cells in the the row. For each, try to find the
# title that represents that column and create a hashtable mapping those
# titles to content
$hash = [Ordered]@{}
for ($i = 0; $i -lt $cells.Count; $i++) {
# decode HTML entities and double-up quotes that the value may contain
$value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
$th = $headers[$i]
$hash[$th] = $value.Trim()
}
# And finally cast that hashtable to a PSCustomObject
[PSCustomObject]$hash
}
}
这样称呼它:
$request = Invoke-WebRequest $uri
$table = ConvertFrom_HtmlTable -WebRequest $request -TableClassName 'organization-admin__table table'
或者如果您知道它是 html 中的第一个或第 x 个 table,则使用 TableIndex 参数,因为它显然没有 id
或 name
如果成功,您只需写入 csv:
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
根据您的评论,您似乎出于某种原因不能使用 Invoke-WebRequest
并且必须使用 IE com 对象进行解析。
试试这个版本的函数:
function ConvertFrom_HtmlTable {
[CmdletBinding(DefaultParameterSetName = 'ByIndex')]
param(
[Parameter(ValueFromPipeline = $true, Mandatory = $true, Position = 0)]
[string]$Url,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
[int]$TableIndex = 0,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
[string]$TableId,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
[string]$TableName,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
[string]$TableClassName,
[switch]$FirstRowHasHeaders
)
$ie = New-Object -ComObject 'InternetExplorer.Application'
$ie.Visible = $false
$ie.Silent = $true
$ie.Navigate($Url)
# wait for IE to fully load the document
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
$doc = $ie.Document
switch ($PSCmdlet.ParameterSetName) {
'ById' { $table = $doc.IHTMLDocument3_getElementByID($TableId) }
'ByIndex' { $table = @($doc.IHTMLDocument3_getElementsByTagName('table'))[$TableIndex]}
'ByName' { $table = @($doc.IHTMLDocument3_getElementsByName($TableName))[0] }
'ByClass' { $table = @($doc.IHTMLDocument3_getElementsByClassName($TableClassName))[0] }
}
if ($table) {
# Extracting table rows as a collection.
$tbody = $table.childNodes | Where-Object { $_.tagName -eq "tbody" }
if ($tbody) {
$rows = $tbody.childNodes | Where-Object { $_.tagName -eq "tr" }
}
else {
$rows = $table.childNodes | Where-Object { $_.tagName -eq "tr" }
}
# read or create table headers
# assume the first row has headers either in <th> or <td> tags
$firstRow = 1
$headers = @($rows[0].childNodes | Where-Object { $_.tagName -eq "th" } | Foreach-Object { $_.innerHTML })
if (!($headers)) {
# there were no <th> tags found, so either use the first row as headers or create from scratch
$values = @($rows[0].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
if ($FirstRowHasHeaders) {
# the headers are considered to be the values from the first row
$headers = $values
}
else {
# the table has no headers, so dynamically create them
$firstRow = 0
$headers = for ($i = 1; $i -le $values.Count; $i++) { "Column_$i" }
}
}
# create a List object to store the values found as PSObjects
$result = [System.Collections.Generic.List[object]]::new()
for ($i = $firstRow; $i -lt $rows.Count; $i++) {
$values = @($rows[$i].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
$valuesCount = $values.Count
while ($headers.Count -lt $valuesCount) {
$colName = "Column_{0}" -f ($headers.Count + 1)
$headers += $colName
# we have just added a new header column. Make sure the first item also has this new column
if ($result.Count) {
$result[0] | Add-Member -MemberType NoteProperty -Name $colName -Value $null
}
}
# create a Hashtable to get store the values
$data = [ordered]@{}
for ($j = 0; $j -lt $valuesCount; $j++) { $data[$headers[$j]] = $values[$j] }
# add the hash cast to PsCustomObject to the list
$result.Add(([PsCustomObject]$data))
}
}
else { Write-Warning "Could not find the given table." }
# quit IE and clean up
$ie.Quit()
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
return $result
}
$table = ConvertFrom_HtmlTable -Url 'your URL here' -TableClassName 'organization-admin__table'
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
使用 InternetExplorer.Application
COM 对象的第二个函数需要使用 DOM 找到 table 对象。
为此,该函数当前使用 IHTMLDocument3 interface,这对我来说在 Windows 10 Pro、PowerShell 5.1 和 IE 版本 11.789.19041.0 上有效,例如当我在
上进行测试时
ConvertFrom_HtmlTable -Url 'https://www.w3schools.com/html/html_tables.asp' -TableId 'customers'
根据您的评论,您收到错误消息:
Method invocation failed because [mshtml.HTMLDocumentClass] does not
contain a method named 'IHTMLDocument3_getElementsByClassName'.
这意味着您的机器上有一个不同的(未更新/损坏的)版本,您必须自己尝试哪种方法有效:
- 首先通过在 PowerShell 控制台中键入以下内容来测试您拥有的 IE 版本:
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').SvcVersion
如果returns空白,试试
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').Version
- 接下来,在
switch
里面改变使用的方法
$doc.IHTMLDocument3_getElementsByClassName($TableClassName)
至
$doc.getElementsByClassName($TableClassName)
或
$doc.documentElement.getElementsByClassName($TableClassName)
如果所有这些都失败了,恐怕您的计算机出现了严重问题(也许是 Invoke-Webrequest
也无法正常工作的原因?)。尝试用 fsc /scannow
来解决这个问题
我是 powershell 的新手,我使用 JohnLBevan 的这段代码将 HTML Table 转换为 CSV:
function ConvertFrom-HtmlTableRow {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTableRow
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
$headers
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
[switch]$isHeader
)
process {
$cols = $htmlTableRow | select -expandproperty td
if($isHeader.IsPresent) {
0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name
} else {
$colCount = ($cols | Measure-Object).Count - 1
$result = new-object -TypeName PSObject
0..$colCount | %{
$colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers
$colValue = $cols[$_]
$result | Add-Member NoteProperty $colName $colValue
}
write-output $result
}
}
}
function ConvertFrom-HtmlTable {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTable
)
process {
#currently only very basic <table><tr><td>...</td></tr></table> structure supported
#could be improved to better understand tbody, th, nested tables, etc
#$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow
#remove anything tags that aren't td or tr (simplifies our parsing of the data
[xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','') -replace '(</?)(?:th)([^>]*/?>)','td'}))
[string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
if ($headers.Count -gt 0) {
$cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
}
}
}
但是每当我从已解析的HTML变量执行它并获取元素标签名“table”时,我都会收到此错误:
Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>
</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+ [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [], RuntimeException
+ FullyQualifiedErrorId : InvalidCastToXmlDocument
我希望有人能帮助我。提前致谢。
我正在尝试使用外部网站。这是 Table 的 HTML 代码:
<table class="organization-admin__table table">
<thead>
<tr>
<th colspan="2">Name</th>
<th>Email address</th>
<th>Timezone</th>
<th>Last logged in</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 <user1@site.com>" title="user1 <user1@site.com>">
<img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 <user1@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user1</a></td>
<td><a href="mailto:user1@site.com">user1@site.com</a></td>
<td>Canada/Eastern</td>
<td>05 Aug 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 <user2@site.ca>" title="user2 <user2@site.ca>">
<img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 <user2@site.ca>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user2</a></td>
<td><a href="mailto:user2@site.ca">user2@site.ca</a></td>
<td>Canada/Eastern</td>
<td>16 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 <user3@site.com>" title="user3 <user3@site.com>">
<img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred <user3@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">Wilfred</a></td>
<td><a href="mailto:user3@site.com">Wilfred@site.com</a></td>
<td>UTC</td>
<td>26 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
</tbody>
</table>
如评论所述,转换为 XML 有严格的规则,当 HTML 忽略编写结束标记时 </tr>
将其加载为 xml 将失败。相同用于没有结束标记 </img>
.
<img>
标记
我没有您正在加载的完整 html,但也许可以试试下面的功能:
function ConvertFrom_HtmlTable {
# adapted from: https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/
[CmdletBinding(DefaultParameterSetName = 'ByIndex')]
param(
[Parameter(Mandatory = $true, Position = 0)]
[Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
[int]$TableIndex = 0,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
[string]$TableId,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
[string]$TableName,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
[string]$TableClassName
)
# Extract the table out of the web request
switch ($PSCmdlet.ParameterSetName) {
'ById' { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
'ByName' { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
}
if (!$table) {
Write-Warning "Could not find the given table."
return $null
}
# load the System.Web assembly to be able to decode HTML entities
Add-Type -AssemblyName System.Web
$headers = @()
# Go through all of the rows in the table
foreach ($row in $table.Rows) {
$cells = @($row.Cells)
# If there is a table header, remember its titles
if($cells[0].tagName -eq "TH") {
$i = 0
$headers = @($cells | ForEach-Object {
$i++
# decode HTML entities and double-up quotes that the value may contain
$th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
# if the table header is empty, create it
if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
})
# proceed with the next row
continue
}
# if we haven't found any table headers, make up names "H1", "H2", etc.
if(-not $headers) {
$headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
}
# Now go through the cells in the the row. For each, try to find the
# title that represents that column and create a hashtable mapping those
# titles to content
$hash = [Ordered]@{}
for ($i = 0; $i -lt $cells.Count; $i++) {
# decode HTML entities and double-up quotes that the value may contain
$value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
$th = $headers[$i]
$hash[$th] = $value.Trim()
}
# And finally cast that hashtable to a PSCustomObject
[PSCustomObject]$hash
}
}
这样称呼它:
$request = Invoke-WebRequest $uri
$table = ConvertFrom_HtmlTable -WebRequest $request -TableClassName 'organization-admin__table table'
或者如果您知道它是 html 中的第一个或第 x 个 table,则使用 TableIndex 参数,因为它显然没有 id
或 name
如果成功,您只需写入 csv:
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
根据您的评论,您似乎出于某种原因不能使用 Invoke-WebRequest
并且必须使用 IE com 对象进行解析。
试试这个版本的函数:
function ConvertFrom_HtmlTable {
[CmdletBinding(DefaultParameterSetName = 'ByIndex')]
param(
[Parameter(ValueFromPipeline = $true, Mandatory = $true, Position = 0)]
[string]$Url,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
[int]$TableIndex = 0,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
[string]$TableId,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
[string]$TableName,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
[string]$TableClassName,
[switch]$FirstRowHasHeaders
)
$ie = New-Object -ComObject 'InternetExplorer.Application'
$ie.Visible = $false
$ie.Silent = $true
$ie.Navigate($Url)
# wait for IE to fully load the document
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
$doc = $ie.Document
switch ($PSCmdlet.ParameterSetName) {
'ById' { $table = $doc.IHTMLDocument3_getElementByID($TableId) }
'ByIndex' { $table = @($doc.IHTMLDocument3_getElementsByTagName('table'))[$TableIndex]}
'ByName' { $table = @($doc.IHTMLDocument3_getElementsByName($TableName))[0] }
'ByClass' { $table = @($doc.IHTMLDocument3_getElementsByClassName($TableClassName))[0] }
}
if ($table) {
# Extracting table rows as a collection.
$tbody = $table.childNodes | Where-Object { $_.tagName -eq "tbody" }
if ($tbody) {
$rows = $tbody.childNodes | Where-Object { $_.tagName -eq "tr" }
}
else {
$rows = $table.childNodes | Where-Object { $_.tagName -eq "tr" }
}
# read or create table headers
# assume the first row has headers either in <th> or <td> tags
$firstRow = 1
$headers = @($rows[0].childNodes | Where-Object { $_.tagName -eq "th" } | Foreach-Object { $_.innerHTML })
if (!($headers)) {
# there were no <th> tags found, so either use the first row as headers or create from scratch
$values = @($rows[0].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
if ($FirstRowHasHeaders) {
# the headers are considered to be the values from the first row
$headers = $values
}
else {
# the table has no headers, so dynamically create them
$firstRow = 0
$headers = for ($i = 1; $i -le $values.Count; $i++) { "Column_$i" }
}
}
# create a List object to store the values found as PSObjects
$result = [System.Collections.Generic.List[object]]::new()
for ($i = $firstRow; $i -lt $rows.Count; $i++) {
$values = @($rows[$i].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
$valuesCount = $values.Count
while ($headers.Count -lt $valuesCount) {
$colName = "Column_{0}" -f ($headers.Count + 1)
$headers += $colName
# we have just added a new header column. Make sure the first item also has this new column
if ($result.Count) {
$result[0] | Add-Member -MemberType NoteProperty -Name $colName -Value $null
}
}
# create a Hashtable to get store the values
$data = [ordered]@{}
for ($j = 0; $j -lt $valuesCount; $j++) { $data[$headers[$j]] = $values[$j] }
# add the hash cast to PsCustomObject to the list
$result.Add(([PsCustomObject]$data))
}
}
else { Write-Warning "Could not find the given table." }
# quit IE and clean up
$ie.Quit()
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
return $result
}
$table = ConvertFrom_HtmlTable -Url 'your URL here' -TableClassName 'organization-admin__table'
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
使用 InternetExplorer.Application
COM 对象的第二个函数需要使用 DOM 找到 table 对象。
为此,该函数当前使用 IHTMLDocument3 interface,这对我来说在 Windows 10 Pro、PowerShell 5.1 和 IE 版本 11.789.19041.0 上有效,例如当我在
ConvertFrom_HtmlTable -Url 'https://www.w3schools.com/html/html_tables.asp' -TableId 'customers'
根据您的评论,您收到错误消息:
Method invocation failed because [mshtml.HTMLDocumentClass] does not contain a method named 'IHTMLDocument3_getElementsByClassName'.
这意味着您的机器上有一个不同的(未更新/损坏的)版本,您必须自己尝试哪种方法有效:
- 首先通过在 PowerShell 控制台中键入以下内容来测试您拥有的 IE 版本:
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').SvcVersion
如果returns空白,试试
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').Version
- 接下来,在
switch
里面改变使用的方法
$doc.IHTMLDocument3_getElementsByClassName($TableClassName)
至
$doc.getElementsByClassName($TableClassName)
或
$doc.documentElement.getElementsByClassName($TableClassName)
如果所有这些都失败了,恐怕您的计算机出现了严重问题(也许是 Invoke-Webrequest
也无法正常工作的原因?)。尝试用 fsc /scannow