PowerShell RSS 提要编码 (UTF-8 | ISO-8859-1) 有问题

PowerShell RSS Feed Encoding (UTF-8 | ISO-8859-1) problematic

我正在尝试通过 PowerShell 下载并解析 rss feed

Invoke-WebRequest -Uri 'https://example.com/rss.php' -OutFile $file -UseBasicParsing -Headers @{"Content-Type"="text/html"; "charset"="utf-8"}

下面你可以看到编码错误的response/download。

<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0"> 
    <channel> 
        <title>Foo RSS-Feed</title> 
        <link>https://example.com</link> 
        <description>Foo übermitteln</description> 
        <language>de-de</language> 
        <copyright>Copyright 2019 Example.com</copyright> 

        <item> 
            <title>Lorem Ipsum</title>
            <link>https://example.com/details.php?id=1234&amp;hit=1</link> 
            <guid>1234</guid> 
            <category>Foo</category> 
            <pubDate>2019-08-09 10:12:49</pubDate> 
            <description>Gr&ouml;&#223;e<br></description>
        </item> 
    </channel>
</rss>

任何人都可以提示我如何成功 encode/decode 响应并将其解析为 xml?


实际上我使用以下代码手动编码流:

$rssResponse = Invoke-WebRequest -UseBasicParsing -Method Get -Headers $defaultHeaders -Uri $uri
$rss = [System.Text.Encoding]::UTF8.GetString($rssResponse.RawContentStream.ToArray())

回复:

<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0"> 
    <channel> 
        <title>Foo RSS-Feed</title> 
        <link>https://example.com</link> 
        <description>Foo übermitteln</description> 
        <language>de-de</language> 
        <copyright>Copyright 2019 Example.com</copyright> 

        <item> 
            <title>Lorem Ipsum</title>
            <link>https://example.com/details.php?id=1234&amp;hit=1</link> 
            <guid>1234</guid> 
            <category>Foo</category> 
            <pubDate>2019-08-09 10:12:49</pubDate> 
            <description>Gr&ouml;&#223;e<br></description>
        </item> 
    </channel>
</rss>

但是还是有问题..

一定有更简单的方法,但鉴于您更新后的代码几乎可以为您提供想要的结果,您只需将 HTML 实体转换为普通文本。

应该这样做:

Add-Type -AssemblyName System.Web

$rssResponse = Invoke-WebRequest -UseBasicParsing -Method Get -Headers $defaultHeaders -Uri $uri
$rss = [System.Web.HttpUtility]::HtmlDecode([System.Text.Encoding]::UTF8.GetString($rssResponse.RawContentStream.ToArray()))

输出:

<?xml version="1.0" encoding="iso-8859-1" ?><rss version="2.0"> 
    <channel> 
        <title>Foo RSS-Feed</title> 
        <link>https://example.com</link> 
        <description>Foo übermitteln</description> 
        <language>de-de</language> 
        <copyright>Copyright 2019 Example.com</copyright> 

        <item> 
            <title>Lorem Ipsum</title>
            <link>https://example.com/details.php?id=1234&hit=1</link> 
            <guid>1234</guid> 
            <category>Foo</category> 
            <pubDate>2019-08-09 10:12:49</pubDate> 
            <description>Größe<br></description>
        </item> 
    </channel>
</rss>