仅从 Python 中的网页内容下载文本

Question

如何从 Python 中的网页下载仅 text/html/javascript？

我正在尝试获取有关博客作者所写文本的一些统计数据。只需要文本，我想通过避免下载图像等来提高我的程序速度。

我能够将文本与 HTML 标记语言分开。所以我的目的主要是避免在网页中下载额外的内容（如图像、.swf 等）

到目前为止我使用：

user_agent = 'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3'
        headers = {'User-Agent': user_agent}
req = urllib2.Request(url, None, headers)
response = urllib2.urlopen(req, timeout=60)
content_type = response.info().getheader('Content-Type')
if 'text/html' in content_type:
   return response.read()

但我不确定自己是否做对了（即只下载文本）

Answer 1

Python BeautifulSoup 最好的网页解析之一

import bs4
import urllib.request

webpage=str(urllib.request.urlopen(link).read())
soup = bs4.BeautifulSoup(webpage)

print(soup.get_text())

仅从 Python 中的网页内容下载文本

Download only the text from a webpage content in Python

python

http

urllib2