如何从一行中的 <a> 字符串中刮取 link？

Question

我正在开发网络抓取工具，它有许多不同的变量，因此将每个变量保持在一行中对我来说很重要。我正在处理的当前变量是这样的：

<a href="http://website.com/example/123" target="_blank">Example</a>

有什么简单的方法可以让我在一行代码中删除网站（http://website.com/example/123 在这种情况下）？

我目前正在使用 urllib、re 和 BeautifulSoup，所以这些库中的任何一个都可以。我尝试添加

.find('a', attrs={'href': re.compile("^http://")})

到我的行尾，但它使输出 return 什么都没有。

Answer 1

我相信你所要做的就是你的变量名['href']:

from bs4 import BeautifulSoup

html = '''<a href="http://website.com/example/123" target="_blank">Example</a>'''

soup = BeautifulSoup(html)

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

找到URL：http://website.com/example/123

如何从一行中的 <a> 字符串中刮取 link？

How do I scrape the link from an <a> string in one line?

python

regex

beautifulsoup

python-2.7