Python BeautifulSoup 阅读网页

Question

大家好……我想阅读 http://www.nydailynews.com/ 上的“最受欢迎”专栏。

Chrome 中的代码如下所示：

我也是：

url = "http://www.nydailynews.com/"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

print soup.find_all(id = 'most-read-content')

但它 returns 没什么。

这里有什么问题吗？是不是因为“最流行”居然是闪光灯什么的？

谢谢。

Answer 1

问题较早开始，与下载实际文本有关。按照您的代码，page.read() returns 空白结果

页面源代码的第一行包含 content="text/html; charset=utf-8"，但这不是真的，或者代码未设置为读取 utf-8

Answer 2

"Thee problem is that the server returns the data compressed by Gzip."

参考如下：

encoding problem in Python when urlopen() a gbk page

Python BeautifulSoup read webpage