使用 python 从 HTML 页表中提取值并保存在 pandas 数据框中时出现问题

Problem in extracting values from HTML Page tables and saving in pandas dataframe using python

我正在从 here 中提取信息 我能够从 table 中提取列名和所有值,但是,我不确定如何在 pandas 中保存这些值数据框。

例如 - 对于第一列 'SPAC' 我将所有 URL 放在一起。这可以保存在一个列表中,该列表又可以添加到数据框中。我面临的主要 issue/challenge 是其余值和列。我得到类似

的输出
SPAC
Target
Ticker
Announced
Deadline
TEV ($M)
TEV/IPO
Sector
Geography
Premium
Common
Warrant
Aldel Financial
Hagerty ADF 8/18/2021 4/13/2023 3,134 2698% Financial US/Canada -0.99% .00 .56
InterPrivate III Financial
Aspiration IPVF 8/18/2021 3/9/2023 1,943 751% Financial US/Canada -1.80% .82 .00

这里的前 12 列是列名,其余(来自 Aldel Financial)是列值,这个 Aldel Financial 我根本不想保存在我的数据框中,因为我已经提取了 link 相同。对于其余的,即来自 Hagery .... 1.56 是各个列的值。

我如何将它保存到 pandas 数据框,因为它们都具有相同的 ID?

这是我的完整代码。

def extraction():
    print("Program started Successfully!")
    websites = ["https://www.spacresearch.com/symbol?s=live-deal&sector=&geography="]
    data = []
    for live_deals in websites:
        browser.get(live_deals)

        wait = WebDriverWait(browser, 10)
        wait.until(
            EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))
        ).send_keys("kys")
        wait.until(
            EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))
        ).send_keys("pasd")
        wait.until(
            EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))
        ).click()

        time.sleep(2)
        spac = browser.find_elements_by_class_name("ellipsis")
        for all_spac in spac:
            links = all_spac.find_element_by_tag_name("a").get_attribute("href")
            print(links)

        target = browser.find_elements_by_id("companies-table-deal-announced")
        for all_targets in target:
            print(all_targets.text)
        print("Program Ended successfully!")


extraction()

请帮助我理解如何将提取的列和值保存在 pandas 数据框中。谢谢!

编辑:所以我尝试使用 pandas 的 read_html() 并得到以下输出

                              SPAC              Target Ticker  Announced   Deadline  TEV ($M)  TEV/IPO  Sector  Geography  Premium  Common  Warrant
0                CA Healthcare Acq            LumiraDx   CAHC   4/7/2021  1/29/2023   30000.0      NaN     NaN        NaN      NaN     NaN      NaN
1  European Sustainable Growth Acq      ADS-TEC Energy   EUSG  8/11/2021  1/26/2023     356.0      NaN     NaN        NaN      NaN     NaN      NaN
2                  Longview Acq II           HeartFlow    LGV  7/15/2021  3/23/2023    2373.0      NaN     NaN        NaN      NaN     NaN      NaN
3                      D8 Holdings  Vicarious Surgical    DEH  4/15/2021  7/17/2022    1119.0      NaN     NaN        NaN      NaN     NaN      NaN
4                 Isos Acquisition             Bowlero   ISOS   7/1/2021   3/5/2023    2616.0      NaN     NaN        NaN      NaN     NaN      NaN

虽然这给了我正确的行和列的值,但它没有给出第一列的 URL,而且很多列值都是 NAN。

我是这样实现的:

table = pd.read_html(live_deals)
df = table[0]
print(df.head())

这应该有效:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver = webdriver.Chrome("D:/chromedriver/94/chromedriver.exe")


driver.get("https://www.spacresearch.com/symbol?s=live-deal&sector=&geography=")
wait = WebDriverWait(driver,15)
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='username'])[2]"))).send_keys("youremail@email.com")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//input[@id='password'])[2]"))).send_keys("YourPassword0101")
wait.until(EC.element_to_be_clickable((By.XPATH, "(//button[text()='Next'])[2]"))).click()

spac_data = {
    "SPAC": [],
    "SPAC_LINK": [],
    "Target": [],
    "Ticker": [],
    "Announced": [],
    "Deadline": [],
    "TEV ($M)": [],
    "TEV/IPO": [],
    "Sector": [],
    "Geography": [],
    "Premium": [],
    "Common": [],
    "Warrant": []
}


tbody = wait.until(EC.presence_of_element_located((By.TAG_NAME, 'tbody')))
rows = tbody.find_elements(By.TAG_NAME, 'tr')

for row in rows:
    cols = row.find_elements(By.TAG_NAME, 'td')
    link = cols[0].find_element(By.TAG_NAME, 'div').find_element(By.TAG_NAME, 'a').get_attribute('href')
    
    spac_data["SPAC"].append(cols[0].text)
    spac_data["SPAC_LINK"].append(link)
    spac_data["Target"].append(cols[1].text)
    spac_data["Ticker"].append(cols[2].text)
    spac_data["Announced"].append(cols[3].text)
    spac_data["Deadline"].append(cols[4].text)
    spac_data["TEV ($M)"].append(cols[5].text)
    spac_data["TEV/IPO"].append(cols[6].text)
    spac_data["Sector"].append(cols[7].text)
    spac_data["Geography"].append(cols[8].text)
    spac_data["Premium"].append(cols[9].text)
    spac_data["Common"].append(cols[10].text)
    spac_data["Warrant"].append(cols[11].text)

df = pd.DataFrame.from_dict(spac_data)
print(df)