我怎样才能可靠地 web-scrape 一条基本上独立的线路？

Question

抱歉，如果这是一个模糊的标题。我试图在一致的基础上抓取 XKCD web-comics 的数量。我看到 http://xkcd.com/ 的首页总是有他们最新的漫画，网站下方还有一行内容：

Permanent link to this comic: http://xkcd.com/1520/

其中1520是最新展出的漫画的编号。我想抓取这个数字，但是，我找不到任何好的方法。目前我所有的尝试看起来都像这样：

soup = BeautifulSoup(urllib.urlopen('http://xkcd.com/').read())
test = soup.find_all('div')[7].get_text().split()[20][-5:-1]

我的意思是.. 技术上 是可行的，但如果网站上的任何内容发生丝毫移动，它可能会严重崩溃。我知道必须有更好的方法来在首页的一部分中搜索 http:xkcd.com/####/，然后只搜索 return ####，但我似乎找不到它。 Permanent link to this comic: http://xkcd.com/1520/ 行似乎有点浮动，没有任何类型的标签、class 或 ID。谁能提供帮助？

Answer 1

通常我坚持使用HTML解析器。在这里，由于我们要在 HTML 中查找特定文本（不检查任何标签），因此可以在以下位置应用正则表达式搜索：

Permanent link to this comic: http://xkcd.com/(\d+)/

在一组中保存数字。

演示：

>>> import re
>>> import requests
>>> 
>>> 
>>> data = requests.get("http://xkcd.com/").content
>>> pattern = re.compile(r'Permanent link to this comic: http://xkcd.com/(\d+)/')
>>> print pattern.search(data).group(1)
1520

我怎样才能可靠地 web-scrape 一条基本上独立的线路？

How can I dependably web-scrape a largely unattached line effectively?

python

beautifulsoup

web-scraping

python-2.7