如何从 python 抓取的 URL 列表中的 URL 抓取数据？

Question

我正在尝试使用 Orange 中的 BeautifulSoup4 从从同一网站抓取的 URL 列表中抓取数据。

当我手动设置 URL 时，我设法从单个页面中抓取了数据。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zone-points.aspx?year=2021&zone=1&section=1901"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")
for child in rank.children:
    print(url,child)

而且我已经能够抓取我需要的URL列表

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import csv
import re

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

rank = soup.find("table", class_="table-standings-body")

link = soup.find('div',class_='contentSection')

url_list = link.find('a').get('href')
for url_list in link.find_all('a'):
    print (url_list.get('href'))

但到目前为止，我无法将两者结合起来从 URL 列表中抓取数据。我只能通过嵌套 for 循环来做到这一点吗？如果可以，怎么做？或者我该怎么做？

如果这是一个愚蠢的问题，我很抱歉，但我昨天才开始尝试 Python 和 Web-Scraping，我无法通过参考类似的主题来解决这个问题。

Answer 1

尝试：

import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://data.ushja.org/awards-standings/zones.aspx?year=2021&zone=1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")

# get all links
url_list = []
for a in soup.find("div", class_="contentSection").find_all("a"):
    url_list.append(a["href"].replace("§", "&sect"))

# get all data from URLs
all_data = []
for url in url_list:
    print(url)

    req = requests.get(url)
    soup = BeautifulSoup(req.text, "html.parser")

    h2 = soup.h2
    sub = h2.find_next("p")

    for tr in soup.select("tr:has(td)"):
        all_data.append(
            [
                h2.get_text(strip=True),
                sub.get_text(strip=True),
                *[td.get_text(strip=True) for td in tr.select("td")],
            ]
        )

# save data to CSV
df = pd.DataFrame(
    all_data,
    columns=[
        "title",
        "sub_title",
        "Rank",
        "Horse / Owner",
        "Points",
        "Total Comps",
    ],
)
print(df)
df.to_csv("data.csv", index=None)

这会遍历所有 URL 并将所有数据保存到 data.csv（来自 LibreOffice 的屏幕截图）：

如何从 python 抓取的 URL 列表中的 URL 抓取数据？

How do I scrape data from URLs in a python-scraped list of URLs?

python

beautifulsoup

web-scraping

orange