HTML Table 到 CSV

HTML Table to CSV

我是 powershell 的新手,我使用 JohnLBevan 的这段代码将 HTML Table 转换为 CSV:

function ConvertFrom-HtmlTableRow {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTableRow
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        $headers
        ,
        [Parameter(Mandatory = $false, ValueFromPipeline = $false)]
        [switch]$isHeader

    )
    process {
        $cols = $htmlTableRow | select -expandproperty td
        if($isHeader.IsPresent) {
            0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name        
        } else {
            $colCount = ($cols | Measure-Object).Count - 1
            $result = new-object -TypeName PSObject
            0..$colCount | %{
                $colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers 
                $colValue = $cols[$_]
                $result | Add-Member NoteProperty $colName $colValue
            } 
            write-output $result
        }
    }
}

function ConvertFrom-HtmlTable {
    [CmdletBinding()]
    param (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true)]
        $htmlTable
    )
    process {
        #currently only very basic <table><tr><td>...</td></tr></table> structure supported
        #could be improved to better understand tbody, th, nested tables, etc

        #$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow

        #remove anything tags that aren't td or tr (simplifies our parsing of the data
        [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','') -replace '(</?)(?:th)([^>]*/?>)','td'})) 
        [string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
        if ($headers.Count -gt 0) {
            $cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
        }
    }
}

但是每当我从已解析的HTML变量执行它并获取元素标签名“table”时,我都会收到此错误:

Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>

</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+         [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp '  ...
+         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : InvalidArgument: (:) [], RuntimeException
    + FullyQualifiedErrorId : InvalidCastToXmlDocument

我希望有人能帮助我。提前致谢。

我正在尝试使用外部网站。这是 Table 的 HTML 代码:

<table class="organization-admin__table table">
          <thead>
            <tr>
              <th colspan="2">Name</th>
              <th>Email address</th>
              <th>Timezone</th>
              <th>Last logged in</th>
              <th>Actions</th>
            </tr>
          </thead>
          <tbody>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 &lt;user1@site.com&gt;" title="user1 &lt;user1@site.com&gt;">
    <img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 &lt;user1@site.com&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">user1</a></td>
              <td><a href="mailto:user1@site.com">user1@site.com</a></td>
              <td>Canada/Eastern</td>
              <td>05 Aug 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 &lt;user2@site.ca&gt;" title="user2 &lt;user2@site.ca&gt;">
    <img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 &lt;user2@site.ca&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">user2</a></td>
              <td><a href="mailto:user2@site.ca">user2@site.ca</a></td>
              <td>Canada/Eastern</td>
              <td>16 Jul 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
            <tr>
              <td width="48px">
  <a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 &lt;user3@site.com&gt;" title="user3 &lt;user3@site.com&gt;">
    <img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&amp;r=g&amp;d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred &lt;user3@site.com&gt;" width="32" height="32">
  </a>

</td>
              <td><a href="/site/users/samluser">Wilfred</a></td>
              <td><a href="mailto:user3@site.com">Wilfred@site.com</a></td>
              <td>UTC</td>
              <td>26 Jul 2021</td>
              <td>
                <ul class="button-group">
                  
                  
                  <li>
                    <a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
                      <i class="fa fa-pencil-alt"></i>
                      Edit
                    </a>
                  </li>
                  
                  
                  <li>
                    <a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
                      <i class="fa fa-trash-alt"></i>
                      Delete
                    </a>
                  </li>
                  
                </ul>
              </td>
            </tr>
            
          </tbody>
    </table>

如评论所述,转换为 XML 有严格的规则,当 HTML 忽略编写结束标记时 </tr> 将其加载为 xml 将失败。相同用于没有结束标记 </img>.

<img> 标记

我没有您正在加载的完整 html,但也许可以试试下面的功能:

function ConvertFrom_HtmlTable {
    # adapted from: https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/
    [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
    param(
        [Parameter(Mandatory = $true, Position = 0)]
        [Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
        [int]$TableIndex = 0,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
        [string]$TableId,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
        [string]$TableName,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
        [string]$TableClassName
    )

    # Extract the table out of the web request
    switch ($PSCmdlet.ParameterSetName) {
        'ById'    { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
        'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
        'ByName'  { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
        'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
    }
    if (!$table) {
        Write-Warning "Could not find the given table."
        return $null
    }

    # load the System.Web assembly to be able to decode HTML entities
    Add-Type -AssemblyName System.Web

    $headers = @()
    # Go through all of the rows in the table
    foreach ($row in $table.Rows) {
        $cells = @($row.Cells)
        # If there is a table header, remember its titles
        if($cells[0].tagName -eq "TH") {
            $i = 0
            $headers = @($cells | ForEach-Object {
                $i++
                # decode HTML entities and double-up quotes that the value may contain
                $th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
                # if the table header is empty, create it
                if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
            })
            # proceed with the next row
            continue
        }
        # if we haven't found any table headers, make up names "H1", "H2", etc.
        if(-not $headers) {
            $headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
        }

        # Now go through the cells in the the row. For each, try to find the
        # title that represents that column and create a hashtable mapping those
        # titles to content
        $hash = [Ordered]@{}
        for ($i = 0; $i -lt $cells.Count; $i++) {
            # decode HTML entities and double-up quotes that the value may contain
            $value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
            $th = $headers[$i]
            $hash[$th] = $value.Trim()
        }
        # And finally cast that hashtable to a PSCustomObject
        [PSCustomObject]$hash
    }
}

这样称呼它:

$request = Invoke-WebRequest $uri
$table = ConvertFrom_HtmlTable -WebRequest $request -TableClassName 'organization-admin__table table'

或者如果您知道它是 html 中的第一个或第 x 个 table,则使用 TableIndex 参数,因为它显然没有 idname

如果成功,您只需写入 csv:

$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation

根据您的评论,您似乎出于某种原因不能使用 Invoke-WebRequest 并且必须使用 IE com 对象进行解析。

试试这个版本的函数:

function ConvertFrom_HtmlTable {
    [CmdletBinding(DefaultParameterSetName = 'ByIndex')]
    param(
        [Parameter(ValueFromPipeline = $true, Mandatory = $true, Position = 0)]
        [string]$Url,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
        [int]$TableIndex = 0,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
        [string]$TableId,

        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
        [string]$TableName,
    
        [Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
        [string]$TableClassName,

        [switch]$FirstRowHasHeaders
    )

    $ie = New-Object -ComObject 'InternetExplorer.Application'
    $ie.Visible = $false
    $ie.Silent = $true
    $ie.Navigate($Url)
    # wait for IE to fully load the document
    while($ie.Busy) { Start-Sleep -Milliseconds 100 }

    $doc = $ie.Document

    switch ($PSCmdlet.ParameterSetName) {
        'ById'    { $table = $doc.IHTMLDocument3_getElementByID($TableId) }
        'ByIndex' { $table = @($doc.IHTMLDocument3_getElementsByTagName('table'))[$TableIndex]}
        'ByName'  { $table = @($doc.IHTMLDocument3_getElementsByName($TableName))[0] }
        'ByClass' { $table = @($doc.IHTMLDocument3_getElementsByClassName($TableClassName))[0] }
    }

    if ($table) {
        # Extracting table rows as a collection.
        $tbody = $table.childNodes | Where-Object { $_.tagName -eq "tbody" }
        if ($tbody) {
            $rows = $tbody.childNodes | Where-Object { $_.tagName -eq "tr" }
        }
        else {
            $rows = $table.childNodes | Where-Object { $_.tagName -eq "tr" }
        }

        # read or create table headers
        # assume the first row has headers either in <th> or <td> tags
        $firstRow = 1
        $headers = @($rows[0].childNodes | Where-Object { $_.tagName -eq "th" } | Foreach-Object { $_.innerHTML })

        if (!($headers)) {
            # there were no <th> tags found, so either use the first row as headers or create from scratch
            $values = @($rows[0].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
            if ($FirstRowHasHeaders) {
                # the headers are considered to be the values from the first row
                $headers = $values
            }
            else {
                # the table has no headers, so dynamically create them
                $firstRow = 0
                $headers =  for ($i = 1; $i -le $values.Count; $i++) { "Column_$i" }
            }
        }

        # create a List object to store the values found as PSObjects
        $result = [System.Collections.Generic.List[object]]::new()
        for ($i = $firstRow; $i -lt $rows.Count; $i++) {
            $values = @($rows[$i].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
            $valuesCount = $values.Count
            while ($headers.Count -lt $valuesCount) {
                $colName = "Column_{0}" -f ($headers.Count + 1)
                $headers += $colName
                # we have just added a new header column. Make sure the first item also has this new column
                if ($result.Count) { 
                    $result[0] | Add-Member -MemberType NoteProperty -Name $colName -Value $null
                }
            }
            # create a Hashtable to get store the values
            $data = [ordered]@{}
            for ($j = 0; $j -lt $valuesCount; $j++) { $data[$headers[$j]] = $values[$j] }
            # add the hash cast to PsCustomObject to the list
            $result.Add(([PsCustomObject]$data))
        }
    }
    else { Write-Warning "Could not find the given table." }

    # quit IE and clean up
    $ie.Quit()
    $null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)
    [System.GC]::Collect()
    [System.GC]::WaitForPendingFinalizers()

    return $result
}

$table = ConvertFrom_HtmlTable -Url 'your URL here' -TableClassName 'organization-admin__table'
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation

使用 InternetExplorer.Application COM 对象的第二个函数需要使用 DOM 找到 table 对象。 为此,该函数当前使用 IHTMLDocument3 interface,这对我来说在 Windows 10 Pro、PowerShell 5.1 和 IE 版本 11.789.19041.0 上有效,例如当我在

上进行测试时
ConvertFrom_HtmlTable -Url 'https://www.w3schools.com/html/html_tables.asp' -TableId 'customers'

根据您的评论,您收到错误消息:

Method invocation failed because [mshtml.HTMLDocumentClass] does not contain a method named 'IHTMLDocument3_getElementsByClassName'.

这意味着您的机器上有一个不同的(未更新/损坏的)版本,您必须自己尝试哪种方法有效:

  1. 首先通过在 PowerShell 控制台中键入以下内容来测试您拥有的 IE 版本:
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').SvcVersion

如果returns空白,试试

(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').Version
  1. 接下来,在switch里面改变使用的方法
$doc.IHTMLDocument3_getElementsByClassName($TableClassName)

$doc.getElementsByClassName($TableClassName)

$doc.documentElement.getElementsByClassName($TableClassName)

如果所有这些都失败了,恐怕您的计算机出现了严重问题(也许是 Invoke-Webrequest 也无法正常工作的原因?)。尝试用 fsc /scannow

来解决这个问题