仅使用 bs 抓取 href returns 第一个 link

Question

我正在尝试使用 bs 抓取 table，并且在其中一列上，可以有多个 link 或 href，例如下面的示例。

<td class="column-6">
    <a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> / 
    <a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> / 
    <a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>

我正在使用下面的代码来定位它们，但这只是 return 第一个 href，对于具有多个 href 的行，return 没有任何其他代码。

from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup

# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"

service = Service("C:\Development\chromedriver_win32\chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)

driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")

ipo_headers = []
ipo_contents = []

for header in all_ipo_header:
    ipo_headers.append(header.text.replace(" ", "_"))

for content in all_ipo_content:
    if content.a:
        a = content.find('a', href=True);
        ipo_contents.append(a['href'])
    else:
        ipo_contents.append(content.text)

# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)

# Next thing to do is scrape a few other websites for comparison and remove duplicates.

当前输出

                     Company_name ASX_code Issue_price  Raise                     Focus                                        Information
0              Allup Silica (TBA)      APS       [=14=].20    m               Silica sand                           https://allupsilica.com/
1          Andean Mining (14 Feb)      ADM       [=14=].20    m       Mineral exploration  https://smallcaps.com.au/andean-mining-ipo-col...
2       Catalano Seafood (24 Feb)      CSF       [=14=].20    m                   Seafood                      https://www.catalanos.net.au/
3     Dragonfly Biosciences (TBA)      DRF       [=14=].20   m           Cannabidiol oil                  https://dragonflybiosciences.com/
4     Equity Story Group (18 Mar)      EQS       [=14=].20  .5m  Market advice & research                        https://equitystory.com.au/
5             Far East Gold (TBA)      FEG       [=14=].20   m       Mineral exploration  https://smallcaps.com.au/far-east-gold-asx-ipo...
6        Killi Resources (10 Feb)      KLI       [=14=].20    m           Gold and copper                          https://www.killi.com.au/
7           Lukin Resources (TBA)      LKN       [=14=].20  .5m       Mineral exploration  https://smallcaps.com.au/lukin-resources-launc...
8         Many Peaks Gold (2 Mar)      MPG       [=14=].20  .5m       Mineral exploration                          https://manypeaks.com.au/
9         Norfolk Metals (14 Mar)      NFL       [=14=].20  .5m          Gold and uranium                      https://norfolkmetals.com.au/
10    Omnia Metals Group (21 Feb)      OM1       [=14=].20  .5m       Mineral exploration                    https://www.omniametals.com.au/
11        Pure Resources (16 Mar)      PR1       [=14=].20  .6m       Mineral exploration                   http://www.pureresources.com.au/
12     Pinnacle Minerals (11 Mar)      PIM       [=14=].20  .5m        Kaolin - Haloysite                   https://pinnacleminerals.com.au/
13          Stelar Metals (7 Mar)      SLB       [=14=].20    m           Copper and zinc                       https://stelarmetals.com.au/
14        Top End Energy (21 Mar)      TEE       [=14=].20  .4m               Oil and gas                    http://www.topendenergy.com.au/
15  US Student Housing REIT (TBA)      USQ       .38   m  US student accommodation                              https://usq-reit.com/

Process finished with exit code 0

The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?

Answer 1

a = content.find('a', href=True);

这应该也是一个find_all，如果不止一个，那么：

a = content.find_all('a', href=True);

Answer 2

以下内容似乎有效 - 它将查找 content.a 内的所有 href 项目以允许多个可用的 href。

for content in all_ipo_content:
    if content.a:
    all_urls = [content.get("href") for content in content.find_all('a')]
    ipo_contents.append(all_urls)

仅使用 bs 抓取 href returns 第一个 link

Scraping href using bs only returns the first link

python

href

dataframe

web-scraping

pandas