'Â' 字符被添加到 HTML 响应

Question

我正在尝试使用 Requests 和 Beautiful Soup 从网页中提取内容。

使用 Requests 检索页面内容时，我运行遇到了一个相当棘手的问题运行。正如您在屏幕截图 (original page) 中看到的那样，Â 个字符似乎被插入到运行dom（我将它们突出显示以使其更清楚）。

示例代码：

from bs4 import BeautifulSoup
import requests

url = 'https://technet.microsoft.com/en-us/sysinternals/bb963902'
r = requests.get(url=url)

with open('/Users/xxxx/test.html', 'wb') as f:
    f.write(r.content)

起初，我认为这与编码不是 UTF-8 有关，但这似乎没问题：

r.encoding
>> 'utf-8'

我尝试使用 curl (curl 7.37.1 (x86_64-apple-darwin14.0) libcurl/7.37.1 SecureTransport zlib/1.2.5) 检索同一页面，但输出中出现了同样的问题。

Answer 1

您正确收到文件。由于 HTML 文件缺少字符集信息，当您查看下载的文件时，浏览器检测到错误的编码（西方而不是 Unicode）。

由于服务器在 Content-Type header.

中发送字符集信息，因此当您在线浏览时它会正确呈现

'Â' 字符被添加到 HTML 响应

'Â' character being added to HTML response

curl

http

python-3.x

python-requests