DOM 在.mht保存的网页中遍历
DOM Traversal in .mht saved webpage
是否可以在另存为 .mht 或另存为 .htm 的网页中进行 DOM 遍历(仅限 html)?
最好在 powershell 或 .net
目标是能够做类似 getElementsByTagName('div')
的事情
如果是,如何?
找到了使用 HtmlAgilityPack 的解决方案。
可以在 NuDoq, which was mentioned in this 上找到文档。
示例代码:
# Choose a source
$Source = 'C:\temp\myFile.mht'
$Source = 'http://www.google.com'
# Get online or mht content
$IE = New-Object -ComObject InternetExplorer.Application
# Don't show the browser
$IE.Visible = $false
# Browse to your webpage/file
$IE.Navigate($Source)
# Wait for page to load
while ($IE.busy) { Sleep -Milliseconds 50 }
# Get the html from that page
$Html = $IE.Document.body.parentElement.outerHTML
# Decode to get rid of html encoded characters like & etc...
$Html = [System.Web.HttpUtility]::HtmlDecode($Html)
# Close the browser
$IE.Quit()
# Use HtmlAgilityPack (must be installed first)
Add-Type -Path (Join-Path $Env:userprofile '.nuget\packages\htmlagilitypack.4.9.5\lib\Net40\HtmlAgilityPack.dll')
$Hap = New-Object HtmlAgilityPack.HtmlDocument
# Load the Html in HtmlAgilityPack to get a DOM
$Hap.LoadHtml($global:Html)
# Retrieve the data from the DOM (read a node)
[string]$partData = $Hap.DocumentNode.SelectSingleNode("//div[@class='formatted_content']/ul").InnerText
是否可以在另存为 .mht 或另存为 .htm 的网页中进行 DOM 遍历(仅限 html)?
最好在 powershell 或 .net
目标是能够做类似 getElementsByTagName('div')
的事情
如果是,如何?
找到了使用 HtmlAgilityPack 的解决方案。
可以在 NuDoq, which was mentioned in this
示例代码:
# Choose a source
$Source = 'C:\temp\myFile.mht'
$Source = 'http://www.google.com'
# Get online or mht content
$IE = New-Object -ComObject InternetExplorer.Application
# Don't show the browser
$IE.Visible = $false
# Browse to your webpage/file
$IE.Navigate($Source)
# Wait for page to load
while ($IE.busy) { Sleep -Milliseconds 50 }
# Get the html from that page
$Html = $IE.Document.body.parentElement.outerHTML
# Decode to get rid of html encoded characters like & etc...
$Html = [System.Web.HttpUtility]::HtmlDecode($Html)
# Close the browser
$IE.Quit()
# Use HtmlAgilityPack (must be installed first)
Add-Type -Path (Join-Path $Env:userprofile '.nuget\packages\htmlagilitypack.4.9.5\lib\Net40\HtmlAgilityPack.dll')
$Hap = New-Object HtmlAgilityPack.HtmlDocument
# Load the Html in HtmlAgilityPack to get a DOM
$Hap.LoadHtml($global:Html)
# Retrieve the data from the DOM (read a node)
[string]$partData = $Hap.DocumentNode.SelectSingleNode("//div[@class='formatted_content']/ul").InnerText