使用 python 从 HTML 页表中提取值并保存在 pandas 数据框中时出现问题
Problem in extracting values from HTML Page tables and saving in pandas dataframe using python
我正在从 here 中提取信息 我能够从 table 中提取列名和所有值,但是,我不确定如何在 pandas 中保存这些值数据框。
例如 - 对于第一列 'SPAC' 我将所有 URL 放在一起。这可以保存在一个列表中,该列表又可以添加到数据框中。我面临的主要 issue/challenge 是其余值和列。我得到类似
的输出
SPAC
Target
Ticker
Announced
Deadline
TEV ($M)
TEV/IPO
Sector
Geography
Premium
Common
Warrant
Aldel Financial
Hagerty ADF 8/18/2021 4/13/2023 3,134 2698% Financial US/Canada -0.99% .00 .56
InterPrivate III Financial
Aspiration IPVF 8/18/2021 3/9/2023 1,943 751% Financial US/Canada -1.80% .82 .00
这里的前 12 列是列名,其余(来自 Aldel Financial)是列值,这个 Aldel Financial 我根本不想保存在我的数据框中,因为我已经提取了 link 相同。对于其余的,即来自 Hagery .... 1.56 是各个列的值。
我如何将它保存到 pandas 数据框,因为它们都具有相同的 ID?
这是我的完整代码。
def extraction():
print("Program started Successfully!")
websites = ["https://www.spacresearch.com/symbol?s=live-deal§or=&geography="]
data = []
for live_deals in websites:
browser.get(live_deals)
wait = WebDriverWait(browser, 10)
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))
).send_keys("kys")
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))
).send_keys("pasd")
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))
).click()
time.sleep(2)
spac = browser.find_elements_by_class_name("ellipsis")
for all_spac in spac:
links = all_spac.find_element_by_tag_name("a").get_attribute("href")
print(links)
target = browser.find_elements_by_id("companies-table-deal-announced")
for all_targets in target:
print(all_targets.text)
print("Program Ended successfully!")
extraction()
请帮助我理解如何将提取的列和值保存在 pandas 数据框中。谢谢!
编辑:所以我尝试使用 pandas 的 read_html() 并得到以下输出
SPAC Target Ticker Announced Deadline TEV ($M) TEV/IPO Sector Geography Premium Common Warrant
0 CA Healthcare Acq LumiraDx CAHC 4/7/2021 1/29/2023 30000.0 NaN NaN NaN NaN NaN NaN
1 European Sustainable Growth Acq ADS-TEC Energy EUSG 8/11/2021 1/26/2023 356.0 NaN NaN NaN NaN NaN NaN
2 Longview Acq II HeartFlow LGV 7/15/2021 3/23/2023 2373.0 NaN NaN NaN NaN NaN NaN
3 D8 Holdings Vicarious Surgical DEH 4/15/2021 7/17/2022 1119.0 NaN NaN NaN NaN NaN NaN
4 Isos Acquisition Bowlero ISOS 7/1/2021 3/5/2023 2616.0 NaN NaN NaN NaN NaN NaN
虽然这给了我正确的行和列的值,但它没有给出第一列的 URL,而且很多列值都是 NAN。
我是这样实现的:
table = pd.read_html(live_deals)
df = table[0]
print(df.head())
这应该有效:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/94/chromedriver.exe")
driver.get("https://www.spacresearch.com/symbol?s=live-deal§or=&geography=")
wait = WebDriverWait(driver,15)
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))).send_keys("youremail@email.com")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))).send_keys("YourPassword0101")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))).click()
spac_data = {
"SPAC": [],
"SPAC_LINK": [],
"Target": [],
"Ticker": [],
"Announced": [],
"Deadline": [],
"TEV ($M)": [],
"TEV/IPO": [],
"Sector": [],
"Geography": [],
"Premium": [],
"Common": [],
"Warrant": []
}
tbody = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'tbody')))
rows = tbody.find_elements(By.TAG_NAME, 'tr')
for row in rows:
cols = row.find_elements(By.TAG_NAME, 'td')
link = cols[0].find_element(By.TAG_NAME, 'div').find_element(By.TAG_NAME, 'a').get_attribute('href')
spac_data["SPAC"].append(cols[0].text)
spac_data["SPAC_LINK"].append(link)
spac_data["Target"].append(cols[1].text)
spac_data["Ticker"].append(cols[2].text)
spac_data["Announced"].append(cols[3].text)
spac_data["Deadline"].append(cols[4].text)
spac_data["TEV ($M)"].append(cols[5].text)
spac_data["TEV/IPO"].append(cols[6].text)
spac_data["Sector"].append(cols[7].text)
spac_data["Geography"].append(cols[8].text)
spac_data["Premium"].append(cols[9].text)
spac_data["Common"].append(cols[10].text)
spac_data["Warrant"].append(cols[11].text)
df = pd.DataFrame.from_dict(spac_data)
print(df)
我正在从 here 中提取信息 我能够从 table 中提取列名和所有值,但是,我不确定如何在 pandas 中保存这些值数据框。
例如 - 对于第一列 'SPAC' 我将所有 URL 放在一起。这可以保存在一个列表中,该列表又可以添加到数据框中。我面临的主要 issue/challenge 是其余值和列。我得到类似
的输出SPAC
Target
Ticker
Announced
Deadline
TEV ($M)
TEV/IPO
Sector
Geography
Premium
Common
Warrant
Aldel Financial
Hagerty ADF 8/18/2021 4/13/2023 3,134 2698% Financial US/Canada -0.99% .00 .56
InterPrivate III Financial
Aspiration IPVF 8/18/2021 3/9/2023 1,943 751% Financial US/Canada -1.80% .82 .00
这里的前 12 列是列名,其余(来自 Aldel Financial)是列值,这个 Aldel Financial 我根本不想保存在我的数据框中,因为我已经提取了 link 相同。对于其余的,即来自 Hagery .... 1.56 是各个列的值。
我如何将它保存到 pandas 数据框,因为它们都具有相同的 ID?
这是我的完整代码。
def extraction():
print("Program started Successfully!")
websites = ["https://www.spacresearch.com/symbol?s=live-deal§or=&geography="]
data = []
for live_deals in websites:
browser.get(live_deals)
wait = WebDriverWait(browser, 10)
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))
).send_keys("kys")
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))
).send_keys("pasd")
wait.until(
EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))
).click()
time.sleep(2)
spac = browser.find_elements_by_class_name("ellipsis")
for all_spac in spac:
links = all_spac.find_element_by_tag_name("a").get_attribute("href")
print(links)
target = browser.find_elements_by_id("companies-table-deal-announced")
for all_targets in target:
print(all_targets.text)
print("Program Ended successfully!")
extraction()
请帮助我理解如何将提取的列和值保存在 pandas 数据框中。谢谢!
编辑:所以我尝试使用 pandas 的 read_html() 并得到以下输出
SPAC Target Ticker Announced Deadline TEV ($M) TEV/IPO Sector Geography Premium Common Warrant
0 CA Healthcare Acq LumiraDx CAHC 4/7/2021 1/29/2023 30000.0 NaN NaN NaN NaN NaN NaN
1 European Sustainable Growth Acq ADS-TEC Energy EUSG 8/11/2021 1/26/2023 356.0 NaN NaN NaN NaN NaN NaN
2 Longview Acq II HeartFlow LGV 7/15/2021 3/23/2023 2373.0 NaN NaN NaN NaN NaN NaN
3 D8 Holdings Vicarious Surgical DEH 4/15/2021 7/17/2022 1119.0 NaN NaN NaN NaN NaN NaN
4 Isos Acquisition Bowlero ISOS 7/1/2021 3/5/2023 2616.0 NaN NaN NaN NaN NaN NaN
虽然这给了我正确的行和列的值,但它没有给出第一列的 URL,而且很多列值都是 NAN。
我是这样实现的:
table = pd.read_html(live_deals)
df = table[0]
print(df.head())
这应该有效:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/94/chromedriver.exe")
driver.get("https://www.spacresearch.com/symbol?s=live-deal§or=&geography=")
wait = WebDriverWait(driver,15)
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))).send_keys("youremail@email.com")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))).send_keys("YourPassword0101")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))).click()
spac_data = {
"SPAC": [],
"SPAC_LINK": [],
"Target": [],
"Ticker": [],
"Announced": [],
"Deadline": [],
"TEV ($M)": [],
"TEV/IPO": [],
"Sector": [],
"Geography": [],
"Premium": [],
"Common": [],
"Warrant": []
}
tbody = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'tbody')))
rows = tbody.find_elements(By.TAG_NAME, 'tr')
for row in rows:
cols = row.find_elements(By.TAG_NAME, 'td')
link = cols[0].find_element(By.TAG_NAME, 'div').find_element(By.TAG_NAME, 'a').get_attribute('href')
spac_data["SPAC"].append(cols[0].text)
spac_data["SPAC_LINK"].append(link)
spac_data["Target"].append(cols[1].text)
spac_data["Ticker"].append(cols[2].text)
spac_data["Announced"].append(cols[3].text)
spac_data["Deadline"].append(cols[4].text)
spac_data["TEV ($M)"].append(cols[5].text)
spac_data["TEV/IPO"].append(cols[6].text)
spac_data["Sector"].append(cols[7].text)
spac_data["Geography"].append(cols[8].text)
spac_data["Premium"].append(cols[9].text)
spac_data["Common"].append(cols[10].text)
spac_data["Warrant"].append(cols[11].text)
df = pd.DataFrame.from_dict(spac_data)
print(df)