PowerShell RSS 提要编码 (UTF-8 | ISO-8859-1) 有问题
PowerShell RSS Feed Encoding (UTF-8 | ISO-8859-1) problematic
我正在尝试通过 PowerShell
下载并解析 rss feed
。
Invoke-WebRequest -Uri 'https://example.com/rss.php' -OutFile $file -UseBasicParsing -Headers @{"Content-Type"="text/html"; "charset"="utf-8"}
下面你可以看到编码错误的response/download。
<description>Foo übermitteln</description>
应该是 Foo übermitteln
<description>Größe<br></description>
应该是 Größe
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0">
<channel>
<title>Foo RSS-Feed</title>
<link>https://example.com</link>
<description>Foo übermitteln</description>
<language>de-de</language>
<copyright>Copyright 2019 Example.com</copyright>
<item>
<title>Lorem Ipsum</title>
<link>https://example.com/details.php?id=1234&hit=1</link>
<guid>1234</guid>
<category>Foo</category>
<pubDate>2019-08-09 10:12:49</pubDate>
<description>Größe<br></description>
</item>
</channel>
</rss>
任何人都可以提示我如何成功 encode/decode 响应并将其解析为 xml
?
实际上我使用以下代码手动编码流:
$rssResponse = Invoke-WebRequest -UseBasicParsing -Method Get -Headers $defaultHeaders -Uri $uri
$rss = [System.Text.Encoding]::UTF8.GetString($rssResponse.RawContentStream.ToArray())
回复:
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0">
<channel>
<title>Foo RSS-Feed</title>
<link>https://example.com</link>
<description>Foo übermitteln</description>
<language>de-de</language>
<copyright>Copyright 2019 Example.com</copyright>
<item>
<title>Lorem Ipsum</title>
<link>https://example.com/details.php?id=1234&hit=1</link>
<guid>1234</guid>
<category>Foo</category>
<pubDate>2019-08-09 10:12:49</pubDate>
<description>Größe<br></description>
</item>
</channel>
</rss>
但是还是有问题..
一定有更简单的方法,但鉴于您更新后的代码几乎可以为您提供想要的结果,您只需将 HTML 实体转换为普通文本。
应该这样做:
Add-Type -AssemblyName System.Web
$rssResponse = Invoke-WebRequest -UseBasicParsing -Method Get -Headers $defaultHeaders -Uri $uri
$rss = [System.Web.HttpUtility]::HtmlDecode([System.Text.Encoding]::UTF8.GetString($rssResponse.RawContentStream.ToArray()))
输出:
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0">
<channel>
<title>Foo RSS-Feed</title>
<link>https://example.com</link>
<description>Foo übermitteln</description>
<language>de-de</language>
<copyright>Copyright 2019 Example.com</copyright>
<item>
<title>Lorem Ipsum</title>
<link>https://example.com/details.php?id=1234&hit=1</link>
<guid>1234</guid>
<category>Foo</category>
<pubDate>2019-08-09 10:12:49</pubDate>
<description>Größe<br></description>
</item>
</channel>
</rss>
我正在尝试通过 PowerShell
下载并解析 rss feed
。
Invoke-WebRequest -Uri 'https://example.com/rss.php' -OutFile $file -UseBasicParsing -Headers @{"Content-Type"="text/html"; "charset"="utf-8"}
下面你可以看到编码错误的response/download。
<description>Foo übermitteln</description>
应该是Foo übermitteln
<description>Größe<br></description>
应该是Größe
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0">
<channel>
<title>Foo RSS-Feed</title>
<link>https://example.com</link>
<description>Foo übermitteln</description>
<language>de-de</language>
<copyright>Copyright 2019 Example.com</copyright>
<item>
<title>Lorem Ipsum</title>
<link>https://example.com/details.php?id=1234&hit=1</link>
<guid>1234</guid>
<category>Foo</category>
<pubDate>2019-08-09 10:12:49</pubDate>
<description>Größe<br></description>
</item>
</channel>
</rss>
任何人都可以提示我如何成功 encode/decode 响应并将其解析为 xml
?
实际上我使用以下代码手动编码流:
$rssResponse = Invoke-WebRequest -UseBasicParsing -Method Get -Headers $defaultHeaders -Uri $uri
$rss = [System.Text.Encoding]::UTF8.GetString($rssResponse.RawContentStream.ToArray())
回复:
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0">
<channel>
<title>Foo RSS-Feed</title>
<link>https://example.com</link>
<description>Foo übermitteln</description>
<language>de-de</language>
<copyright>Copyright 2019 Example.com</copyright>
<item>
<title>Lorem Ipsum</title>
<link>https://example.com/details.php?id=1234&hit=1</link>
<guid>1234</guid>
<category>Foo</category>
<pubDate>2019-08-09 10:12:49</pubDate>
<description>Größe<br></description>
</item>
</channel>
</rss>
但是还是有问题..
一定有更简单的方法,但鉴于您更新后的代码几乎可以为您提供想要的结果,您只需将 HTML 实体转换为普通文本。
应该这样做:
Add-Type -AssemblyName System.Web
$rssResponse = Invoke-WebRequest -UseBasicParsing -Method Get -Headers $defaultHeaders -Uri $uri
$rss = [System.Web.HttpUtility]::HtmlDecode([System.Text.Encoding]::UTF8.GetString($rssResponse.RawContentStream.ToArray()))
输出:
<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0"> <channel> <title>Foo RSS-Feed</title> <link>https://example.com</link> <description>Foo übermitteln</description> <language>de-de</language> <copyright>Copyright 2019 Example.com</copyright> <item> <title>Lorem Ipsum</title> <link>https://example.com/details.php?id=1234&hit=1</link> <guid>1234</guid> <category>Foo</category> <pubDate>2019-08-09 10:12:49</pubDate> <description>Größe<br></description> </item> </channel> </rss>