使用 startswith 函数过滤 url 列表

Question

我有以下代码，它从页面中提取所有 link 并将它们放入列表 (links=[])，然后将其传递给函数 filter_links() . 我希望过滤掉与起始 link（即列表中的第一个 link 不来自同一域的所有 link。这是我的：

import requests
from bs4 import BeautifulSoup
import re

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])


def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
        return filtered_links


print(filter_links(links))

我使用了内置的 startswith 函数，但它过滤掉了除开头之外的所有内容 url。最终我想通过这个程序传递几个不同的开始 urls，所以我需要一种通用的方法来过滤与开始 url.I 在同一域内的 urls，我认为我可以使用正则表达式，但此功能也应该有效？

Answer 1

试试这个：

import requests
from bs4 import BeautifulSoup
import re
import tldextract

start_url = "http://www.enzymebiosystems.org/"
r = requests.get(start_url)
html_content = r.text
soup = BeautifulSoup(html_content, features='lxml')
links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

def filter_links(links):
    ext = tldextract.extract(start_url)
    domain = ext.domain
    filtered_links = []
    for link in links:
        if domain in link:
            filtered_links.append(link)
    return filtered_links


print(filter_links(links))

注意 :

您需要从 for 循环中取出 return 语句。它只是 return 迭代一个元素后的结果，因此只有列表中的第一项才得到 returned。
使用tldextract模块更好地从url中提取域名。如果您想明确检查链接是否以 links[0] 开头，则由您决定。

输出：

['http://enzymebiosystems.org', 'http://enzymebiosystems.org/', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/recent-developments/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/contact-us/', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/about', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/leadership/marketing-strategy', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/contact-us', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/view-sec-filings/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/unregistered-sale-of-equity-securities/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/enzymebiosystems-files-sec-form-8-k-change-in-directors-or-principal-officers/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/form-10-q-for-enzymebiosystems/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org', 'http://enzymebiosystems.org/leadership/about/', 'http://enzymebiosystems.org/leadership/directors-advisors/', 'http://enzymebiosystems.org/leadership/mission-values/', 'http://enzymebiosystems.org/leadership/marketing-strategy/', 'http://enzymebiosystems.org/leadership/business-strategy/', 'http://enzymebiosystems.org/technology/research/', 'http://enzymebiosystems.org/technology/manufacturer/', 'http://enzymebiosystems.org/investors-media/news/', 'http://enzymebiosystems.org/investors-media/investor-relations/', 'http://enzymebiosystems.org/investors-media/press-releases/', 'http://enzymebiosystems.org/investors-media/stock-information/', 'http://enzymebiosystems.org/investors-media/presentations-downloads/', 'http://enzymebiosystems.org/contact-us']

Answer 2

可能的解决方案

What about if you kept all links which 'contain' the domain?

例如

import pandas as pd

links = []
for tag in soup.find_all('a', href=True):
    links.append(tag['href'])

all_links = pd.DataFrame(links, columns=["Links"])
enzyme_df = all_links[all_links.Links.str.contains("enzymebiosystems")]

# results in a dataframe with links containing "enzymebiosystems".

如果要搜索多个域，

Answer 3

好的，所以你在 filter_links(links) 中出现了缩进错误。函数应该是这样的

def filter_links(links):
    filtered_links = []
    for link in links:
        if link.startswith(links[0]):
            filtered_links.append(link)
    return filtered_links

请注意，在您的代码中，您将 return 语句保留在 for 循环中，因此，for 循环执行一次，然后 returns 列表。

希望这有帮助:)

使用 startswith 函数过滤 url 列表

Using startswith function to filter a list of urls

python

web-scraping

beautifulsoup

startswith

可能的解决方案