Python

Question

我正在尝试使用 requests 库构建一个 python 爬虫。当我使用 get 方法时，我检索到的结果如下所示：THá» THAO。但是当我使用 curl 时，我得到了 THỂ THAO，这是我的预期结果。这是我的代码：

def get_raw_channel():
    r = requests.get('http://vtv.vn/')
    raw_html = r.text
    soup = BeautifulSoup(raw_html)
    o_tags = soup.find_all("option")
    for o_tag in o_tags:
        print o_tag.text
        # raw_channel = RawChannel(o_tag.text.strip(), o_tag['value'])
        # channels_file.write(raw_channel.__str__() + '\n')

这是我的 curl 命令：curl http://vtv.vn/

问题：为什么结果不一样？如何使用 requests 获得 curl 的结果？

Answer 1

我试过你的代码，在我的例子中编码是 'ISO-8859-1'，在 BS 中处理之前尝试将你的数据编码为 UTF-8，类似于：

...
raw_html = r.text.encode("utf-8")
soup = BeautifulSoup(raw_html)
...

更新： 我做了一些更多的测试，看起来一切都对我有用，因为我为请求明确设置了编码，看看

In [1]: import requests
In [2]: from BeautifulSoup import BeautifulSoup
In [3]: r = requests.get('http://vtv.vn/')
In [4]: r.encoding = "utf-8"
In [5]: raw_html = r.text
In [6]: soup = BeautifulSoup(raw_html)
In [7]: soup.findAll("option")
Out[7]: 
[<option value="1">
 VTV1</option>,
 ... stripped out some output ...

 VTVCab3 - Thể thao TV</option>,
 <option value="13">

 ... stripped out some output ...
]

Python - 使用 curl 和请求库时检索到不同的结果

Python - retrieved different result when using curl and requests library

curl

python-2.7

python-requests