使用 Selenium 和 Python 在网站上获取 table 的内容

Getting the content of a table on the website with Selenium and Python

当我转到代码中的网址时,我没有从“同义词”部分获取内容。它进行选择,但将其作为列表而不输出文本内容。

synonyms= []
driver= webdriver.Chrome()
url = "https://pubchem.ncbi.nlm.nih.gov/compound/71308229"
driver.get(url)
synonym = driver.find_elements_by_class_name("overflow-x-auto")
synonyms.append(synonym)
driver.close()

您需要明确获取元素的文本

synonyms= []
driver= webdriver.Chrome()
url = "https://pubchem.ncbi.nlm.nih.gov/compound/71308229"
driver.get(url)
synonym = driver.find_elements_by_class_name("overflow-x-auto")
synonyms.append([s.text for s in synonym])
print(synonyms)
driver.close()

输出

[['Lanthanum boride\n12008-21-8\nLanthanum hexaboride\nMFCD00151350\nB6La\nMore...', 'Lanthanum boride\n12008-21-8\nLanthanum hexaboride\nMFCD00151350\nB6La\nLanthanum Hexaboride Nanoparticles\nLanthanum boride, 99.5% (REO)\nIron Boride (FeB) Sputtering Targets\nFT-0693450\nLanthanum hexaboride, powder, 10 mum, 99%\nY1387\nLanthanum hexaboride LaB6 GRADE A (H?gan?s)\nLanthanum hexaboride, powder, -325 mesh, 99.5% metals basis\nLanthanum boride, powder, -325 mesh, 99.5% trace metals basis\nLine position and line shape standard for powder diffraction, NIST SRM 660c, Lanthanum hexaboride powder', 'Property Name Property Value Reference\nMolecular Weight 203.8 Computed by PubChem 2.1 (PubChem release 2021.05.07)\nHydrogen Bond Donor Count 0 Computed by Cactvs 3.4.8.18 (PubChem release 2021.05.07)\nHydrogen Bond Acceptor Count 2 Computed by Cactvs 3.4.8.18 (PubChem release 2021.05.07)\nRotatable Bond Count 0 Computed by Cactvs 3.4.8.18 (PubChem release 2021.05.07)\nExact Mass 203.965826 Computed by PubChem 2.1 (PubChem release 2021.05.07)\nMonoisotopic Mass 204.962194 Computed by PubChem 2.1 (PubChem release 2021.05.07)\nTopological Polar Surface Area 0 Ų Computed by Cactvs 3.4.8.18 (PubChem release 2021.05.07)\nHeavy Atom Count 7 Computed by PubChem\nFormal Charge -2 Computed by PubChem\nComplexity 132 Computed by Cactvs 3.4.8.18 (PubChem release 2021.05.07)\nIsotope Atom Count 0 Computed by PubChem\nDefined Atom Stereocenter Count 0 Computed by PubChem\nUndefined Atom Stereocenter Count 0 Computed by PubChem\nDefined Bond Stereocenter Count 0 Computed by PubChem\nUndefined Bond Stereocenter Count 0 Computed by PubChem\nCovalently-Bonded Unit Count 2 Computed by PubChem\nCompound Is Canonicalized Yes Computed by PubChem (release 2021.05.07)', 'Mixtures, Components, and Neutralized Forms 2 Records\nSimilar Compounds 2 Records', 'Same 25 Records']]
  1. 您错过了等待/延迟。
  2. 您必须从网络元素中提取文本
  3. 您似乎使用了错误的定位器

我想这会给你想要的东西:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

synonyms= []
driver= webdriver.Chrome()
url = "https://pubchem.ncbi.nlm.nih.gov/compound/71308229"
driver.get(url)
wait = WebDriverWait(driver, 20)
wait.until(EC.visibility_of_element_located((By.XPATH, "//div[@class='overflow-x-auto']//p")))
time.sleep(0.1)
elements = driver.find_elements_by_xpath("//div[@class='overflow-x-auto']//p")
for el in elements:
    synonyms.append(el.text)
driver.close()

要从 Synonyms table 中提取内容,您必须归纳 for visibility_of_all_elements_located() and you can use the following :

  • 使用 XPATH:

    driver.get("https://pubchem.ncbi.nlm.nih.gov/compound/71308229")
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//th[text()='Synonyms']//following::td[1]//p")))])
    
  • 控制台输出:

    ['Lanthanum boride', '12008-21-8', 'Lanthanum hexaboride', 'MFCD00151350', 'B6La']
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC