仅使用 bs 抓取 href returns 第一个 link
Scraping href using bs only returns the first link
我正在尝试使用 bs 抓取 table,并且在其中一列上,可以有多个 link 或 href,例如下面的示例。
<td class="column-6">
<a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> /
<a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> /
<a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>
我正在使用下面的代码来定位它们,但这只是 return 第一个 href,对于具有多个 href 的行,return 没有任何其他代码。
from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup
# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"
service = Service("C:\Development\chromedriver_win32\chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)
driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")
ipo_headers = []
ipo_contents = []
for header in all_ipo_header:
ipo_headers.append(header.text.replace(" ", "_"))
for content in all_ipo_content:
if content.a:
a = content.find('a', href=True);
ipo_contents.append(a['href'])
else:
ipo_contents.append(content.text)
# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)
# Next thing to do is scrape a few other websites for comparison and remove duplicates.
当前输出
Company_name ASX_code Issue_price Raise Focus Information
0 Allup Silica (TBA) APS [=14=].20 m Silica sand https://allupsilica.com/
1 Andean Mining (14 Feb) ADM [=14=].20 m Mineral exploration https://smallcaps.com.au/andean-mining-ipo-col...
2 Catalano Seafood (24 Feb) CSF [=14=].20 m Seafood https://www.catalanos.net.au/
3 Dragonfly Biosciences (TBA) DRF [=14=].20 m Cannabidiol oil https://dragonflybiosciences.com/
4 Equity Story Group (18 Mar) EQS [=14=].20 .5m Market advice & research https://equitystory.com.au/
5 Far East Gold (TBA) FEG [=14=].20 m Mineral exploration https://smallcaps.com.au/far-east-gold-asx-ipo...
6 Killi Resources (10 Feb) KLI [=14=].20 m Gold and copper https://www.killi.com.au/
7 Lukin Resources (TBA) LKN [=14=].20 .5m Mineral exploration https://smallcaps.com.au/lukin-resources-launc...
8 Many Peaks Gold (2 Mar) MPG [=14=].20 .5m Mineral exploration https://manypeaks.com.au/
9 Norfolk Metals (14 Mar) NFL [=14=].20 .5m Gold and uranium https://norfolkmetals.com.au/
10 Omnia Metals Group (21 Feb) OM1 [=14=].20 .5m Mineral exploration https://www.omniametals.com.au/
11 Pure Resources (16 Mar) PR1 [=14=].20 .6m Mineral exploration http://www.pureresources.com.au/
12 Pinnacle Minerals (11 Mar) PIM [=14=].20 .5m Kaolin - Haloysite https://pinnacleminerals.com.au/
13 Stelar Metals (7 Mar) SLB [=14=].20 m Copper and zinc https://stelarmetals.com.au/
14 Top End Energy (21 Mar) TEE [=14=].20 .4m Oil and gas http://www.topendenergy.com.au/
15 US Student Housing REIT (TBA) USQ .38 m US student accommodation https://usq-reit.com/
Process finished with exit code 0
The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?
a = content.find('a', href=True);
这应该也是一个find_all,如果不止一个,那么:
a = content.find_all('a', href=True);
以下内容似乎有效 - 它将查找 content.a 内的所有 href 项目以允许多个可用的 href。
for content in all_ipo_content:
if content.a:
all_urls = [content.get("href") for content in content.find_all('a')]
ipo_contents.append(all_urls)
我正在尝试使用 bs 抓取 table,并且在其中一列上,可以有多个 link 或 href,例如下面的示例。
<td class="column-6">
<a href="https://smallcaps.com.au/andean-mining-ipo-colombia-exploration-high-grade-copper-gold-target/" rel="noopener noreferrer" target="_blank">Article</a> /
<a href="https://www.youtube.com/watch?v=Kgew7tuLWCg" rel="noopener noreferrer" target="_blank">Video</a> /
<a href="https://andeanmining.com.au/" rel="noopener noreferrer" target="_blank">Website</a></td>
我正在使用下面的代码来定位它们,但这只是 return 第一个 href,对于具有多个 href 的行,return 没有任何其他代码。
from time import sleep
import numpy as np
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 3000)
from bs4 import BeautifulSoup
# Scrape the smallcaps website for IPO Information and save into dataframe
smallcaps_URL = "https://smallcaps.com.au/upcoming-ipos/"
service = Service("C:\Development\chromedriver_win32\chromedriver.exe")
chrome_options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service)
driver.get(smallcaps_URL)
sleep(3)
close_popup = driver.find_element(By.CLASS_NAME, "tve_ea_thrive_leads_form_close")
close_popup.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
all_ipo_header = soup.find_all("th")
all_ipo_content = soup.find_all("td")
ipo_headers = []
ipo_contents = []
for header in all_ipo_header:
ipo_headers.append(header.text.replace(" ", "_"))
for content in all_ipo_content:
if content.a:
a = content.find('a', href=True);
ipo_contents.append(a['href'])
else:
ipo_contents.append(content.text)
# Prints complete scraped dataframe from SmallCaps website
df = pd.DataFrame(np.reshape(ipo_contents, (-1, 6)), columns=ipo_headers)
print(df)
# Next thing to do is scrape a few other websites for comparison and remove duplicates.
当前输出
Company_name ASX_code Issue_price Raise Focus Information
0 Allup Silica (TBA) APS [=14=].20 m Silica sand https://allupsilica.com/
1 Andean Mining (14 Feb) ADM [=14=].20 m Mineral exploration https://smallcaps.com.au/andean-mining-ipo-col...
2 Catalano Seafood (24 Feb) CSF [=14=].20 m Seafood https://www.catalanos.net.au/
3 Dragonfly Biosciences (TBA) DRF [=14=].20 m Cannabidiol oil https://dragonflybiosciences.com/
4 Equity Story Group (18 Mar) EQS [=14=].20 .5m Market advice & research https://equitystory.com.au/
5 Far East Gold (TBA) FEG [=14=].20 m Mineral exploration https://smallcaps.com.au/far-east-gold-asx-ipo...
6 Killi Resources (10 Feb) KLI [=14=].20 m Gold and copper https://www.killi.com.au/
7 Lukin Resources (TBA) LKN [=14=].20 .5m Mineral exploration https://smallcaps.com.au/lukin-resources-launc...
8 Many Peaks Gold (2 Mar) MPG [=14=].20 .5m Mineral exploration https://manypeaks.com.au/
9 Norfolk Metals (14 Mar) NFL [=14=].20 .5m Gold and uranium https://norfolkmetals.com.au/
10 Omnia Metals Group (21 Feb) OM1 [=14=].20 .5m Mineral exploration https://www.omniametals.com.au/
11 Pure Resources (16 Mar) PR1 [=14=].20 .6m Mineral exploration http://www.pureresources.com.au/
12 Pinnacle Minerals (11 Mar) PIM [=14=].20 .5m Kaolin - Haloysite https://pinnacleminerals.com.au/
13 Stelar Metals (7 Mar) SLB [=14=].20 m Copper and zinc https://stelarmetals.com.au/
14 Top End Energy (21 Mar) TEE [=14=].20 .4m Oil and gas http://www.topendenergy.com.au/
15 US Student Housing REIT (TBA) USQ .38 m US student accommodation https://usq-reit.com/
Process finished with exit code 0
The expected output should have three links/hrefs for some rows the 'Information' column, however it is only returning the first link/href for all of them. Could someone please guide me in the right direction?
a = content.find('a', href=True);
这应该也是一个find_all,如果不止一个,那么:
a = content.find_all('a', href=True);
以下内容似乎有效 - 它将查找 content.a 内的所有 href 项目以允许多个可用的 href。
for content in all_ipo_content:
if content.a:
all_urls = [content.get("href") for content in content.find_all('a')]
ipo_contents.append(all_urls)