在 Python 中使用 lxml 和 xpath 时验证 link 是否以 http 开头

Question

我正在尝试使用以下方法从多个页面打印所有 link：

my_page = '//div[@class="product_info"]//table//tr[7]//td[2]//a/@href'

现在，这适用于大多数 link，但在某些情况下，我有类似的东西：

<a href="To follow">To follow</a> 这不是 link.

如何省略这些 link？使用时应该使用什么条件：

# some more code
EMPTY = ''
my_page = '//div[@class="product_info"]//table//tr[7]//td[2]//a/@href'

for part in dom1.xpath(my_page):
    FINAL_URL = urlparse.urljoin(url, part)

    if part == EMPTY:
        continue
    print part

Answer 1

要过滤那些以 https:// 或 http:// 开头的 link，只需在循环中添加一个条件：

# some more code
EMPTY = ''
other_links = set()
processed_links = set()
my_page = '//div[@class="product_info"]//table//tr[7]//td[2]//a/@href'

for part in dom1.xpath(my_page):
    if part[:4] == 'http':
        if part not in processed_links:
            processed_links.add(part)
            FINAL_URL = urlparse.urljoin(url, part)
    else:
        other_links.add(part)

我还添加了一些代码，以便：

您收集所有其他未处理的 link。
如果同一个（有效的）link在页面中出现不止一次，您只处理一次。

在 Python 中使用 lxml 和 xpath 时验证 link 是否以 http 开头

Verify if link starts with http when using lxml and xpath in Python

python

xpath

lxml