BeautifulSoup returns 缩短了同一网站上网页的网址

Question

我的参考代码：

import httplib2
from bs4 import BeautifulSoup

h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
    urls.append(tag['href'])
responses = []
contents = []
for url in urls:
    try:
        response1, content1 = h.request(url)
        responses.append(response1)
        contents.append(content1)
    except:
        pass

我的想法是，我获取网页的有效负载，然后从中抓取超链接。其中一个链接指向 yahoo.com，另一个指向“http://csb.stanford.edu/class/public/index.html”

但是我从 BeautifulSoup 得到的结果是：

>>> urls
['http://www.yahoo.com/', '../../index.html']

这会带来一个问题，因为脚本的第二部分无法在缩短的第二个 url 上执行。有没有办法让 BeautifulSoup 检索完整的 url？

Answer 1

那是因为网页上的link实际上就是那种形式。页面中的 HTML 是：

<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>

这叫亲戚link。

要将其转换为绝对 link，您可以使用标准库中的 urljoin。

from urllib.parse import urljoin  # Python3

urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
        '../../index.html')
# returns http://csb.stanford.edu/class/public/index.html

BeautifulSoup returns 缩短了同一网站上网页的网址

BeautifulSoup returns urls of pages on same website shortened

beautifulsoup

httplib2

python-3.x