查找具有特定锚文本 Python / BeautifulSoup4 的所有 URL

Find all URLs with a specific anchor text Python / BeautifulSoup4

我正在尝试获取所有带有锚文本 Personal Website 的 URL。这是我正在谈论的 HTML 的示例。

<a href="http://example.com" class>Personal website</a>

现在,我正在尝试这样做:

for link in bio_link_list:
    site = soup.find_all("a", href = True, text = "Personal Website")
    site_list.append(site)

其中 bio_link_list 只是我正在抓取的链接列表。但这只是 returns 一个空列表。为了澄清起见,我想要一个包含该特定锚文本的 URL 列表。

您似乎忽略了 link 变量。您确定不想先向 link 发出请求,然后再抓取结果 HTML 吗? 不管怎样,试试这个。

urls = [tag["href"] for tag in soup.find_all("a") if tag.getText() == "Personal Website" and tag["href"]]

您的代码无法正常工作的原因是没有 class Perosnal website 的锚点。但是在检查源代码之后,您可以轻松地获取所有 a 元素并将它们过滤为 bio 字符串。

试试这个:

import requests
from bs4 import BeautifulSoup

url = "https://www.stern.nyu.edu/faculty/search_name_form"

soup = BeautifulSoup(requests.get(url).content, "html.parser").find_all("a")
bio_links = [a['href'] for a in soup if "bio" in a['href']]

print(f"Found {len(bio_links)} bio links:")
print(bio_links)

输出:

Found 465 bio links
['https://www.stern.nyu.edu/faculty/bio/viral-acharya', 'https://www.stern.nyu.edu/faculty/bio/allen-adamson', 'https://www.stern.nyu.edu/faculty/bio/beril-afsar',...]

这会产生 465 结果,与 <div class="results">465 results</div>.

页面上的值完全相同

编辑:我最初误解了你的问题,但这里有一种从生物 url 获取个人网站的方法。

import requests
from bs4 import BeautifulSoup

url = "https://www.stern.nyu.edu/faculty/search_name_form"

soup = BeautifulSoup(requests.get(url).content, "html.parser").find_all("a")
bio_links = [a['href'] for a in soup if "bio" in a['href']]

personal_sites = []
for link in bio_links:
    print(link)
    soup = BeautifulSoup(requests.get(link).content, "html.parser")
    personal_sites.extend(
        [
            a["href"] for a in soup.find_all("a") 
            if a.getText() == "Personal website"
        ]
    )

print(personal_sites)

输出:

['http://pages.stern.nyu.edu/~sternfin/vacharya/public_html/~vacharya.htm', 'http://brandsimple.com/about-allen-adamson/', 'http://pages.stern.nyu.edu/~sternfin/talbanese/', 'http://www.stern.nyu.edu/~ealtman', ...]

最后,您可以使用 lxml 模块和 Xpath 来加快获取个人网站链接的速度。

import requests

from bs4 import BeautifulSoup
from lxml import html

url = "https://www.stern.nyu.edu/faculty/search_name_form"
ps_xpath = '//*[@id="bio-details"]/div[2]/p[3]/a[2]/@href'


def get_page(url: str):
    return requests.get(url).content


def get_personal_site(url: str):
    ps = html.fromstring(get_page(url)).xpath(ps_xpath)
    return next(iter(ps), None)


def scrape_sites(bio_links: list):
    return [get_personal_site(link) for link in bio_links]


soup = BeautifulSoup(get_page(url), "html.parser").find_all("a")
links = [a['href'] for a in soup if "bio" in a['href']]

print(scrape_sites(links))

此答案假设您在 bio_link_list 中有配置文件 URL。如果没有,有一种方法可以在评估个人资料页面之前使用以下代码的更改来抓取信息。

您可以试试这个解决方案。文本文件 List.txt 是 url 的来源。每行一个。这也会将所有输出写入 csv 文件以备后用。

import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
from datetime import datetime
from lxml.html import parse
import re

dataList = []

New_date_id = datetime.now().strftime("%Y%m%d-%H%M%S")

with open('List.txt') as urlList:
    urls = (line for line in urlList)

    for url in urls:
        site = urlopen(url)   
        soup = BeautifulSoup(site, "lxml")
        for item in soup.findAll('a', text = re.compile('Personal website')):
            newItem = (item.get ('href'))
            dataList.append(newItem)
            
#with open(date_id + 'data.txt', "w", encoding='utf-8') as fhandle:
with open(New_date_id + "_"+ 'data.csv', 'w', newline='', encoding='utf-8') as csv_file:
    for line in dataList:       
        csv_file.write(f'{line} \n')
        

我想一个简单的 if 条件就可以解决问题

for url in all_bio_links:
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    links=[]
    for a in soup.find_all('a', href=True):
        if a.text=="Personal website":
            links.append(a['href'])