如何在 Python 中使用 Selenium 区分具有相同相对 XPATH 的两个表

Question

我正在尝试从 IMDb 中抓取一些数据（在 Python 中使用 selenium），但我遇到了问题。对于每部电影，我都必须请来导演和编剧。这两个元素都包含在两个 table 中，并且它们具有相同的 @class。我抓取的时候需要区分这两个table，否则有时程序会把一个作家当成导演，反之亦然。

我尝试使用 relative XPATH 来查找具有该 xpath 的所有元素 (tables)，然后将它们放入一个循环中，我试图通过 table 来区分它们] 标题（即 h4 元素）和 preceding-sibling 函数。代码有效，但没有找到任何东西（每次 returns nan）。

这是我的代码：

    counter = 1
    try:
        driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
        ssleep()
        tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
        counter = 1
        for table in tables:
            xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]' 
            xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
            table_title = driver.find_element(By.XPATH, xpath_h4).text
            if table_title == "Directed by":
                rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
                for row in rows_director:
                    director = row.find_elements(By.CSS_SELECTOR, 'a')
                    director = [x.text for x in director]
                    if len(director) == 1:
                        director = ''.join(map(str, director))
                    else:
                        director = ', '.join(map(str, director))
                        director_list.append(director)
        counter += 1

    except NoSuchElementException:
        # director = np.nan
        director_list.append(np.nan)

你们谁能告诉我为什么它不起作用？也许有更好的解决方案。希望得到您的帮助。

（在这里你可以找到我需要抓取的页面示例：https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm）

Answer 1

您可以使用h4标签的id属性Directors 和 Writers 提取数据。

尝试如下：

# Imports Required
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

links = ["https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt10234724/fullcredits/?ref_=tt_cl_sm",
         "https://www.imdb.com/title/tt10872600/fullcredits?ref_=tt_cl_wr_sm","https://www.imdb.com/title/tt1160419/fullcredits?ref_=tt_cl_wr_sm"]

for link in links:
    driver.get(link)
    wait = WebDriverWait(driver,20)
    
    # Get the name of the movie
    name = wait.until(EC.presence_of_element_located((By.XPATH,"//h3[@itemprop='name']/a"))).text
    
    # Get the Directors
    directors = driver.find_elements(By.XPATH,"//h4[@id='director']/following-sibling::table[1]//tr")
    dir_list = []
    for director in directors:
        # Add the director names in the list. You can format the unwanted string using replace.
        dir_list.append(director.text)

    # Get the Writers
    writers = driver.find_elements(By.XPATH,"//h4[@id='writer']/following-sibling::table[1]//tr")
    wri_list = []
    for writer in writers:
        # Add the Writer names in the list. You can format the unwanted string using replace.
        wri_list.append(writer.text)

    # Print the data.
    print(f"Name of the movie: {name}")
    print(f"Directors : {dir_list}")
    print(f"Writers : {wri_list}")

输出：

Name of the movie: The Batman
Directors : ['Matt Reeves ... (directed by)']
Writers : ['Matt Reeves ... (written by) &', 'Peter Craig ... (written by)', ' ', 'Bill Finger ... (Batman created by) &', 'Bob Kane ... (Batman created by)']
Name of the movie: Moon Knight
Directors : ['Justin Benson ... (5 episodes, 2022)', 'Mohamed Diab ... (5 episodes, 2022)', 'Aaron Moorhead ... (5 episodes, 2022)']
Writers : ['Danielle Iman ... (staff writer) (6 episodes, 2022)', 'Doug Moench ... (characters) (6 episodes, 2022)', 'Doug Moench ... (creator) (6 episodes, 2022)', 'Don Perlin ... (characters) (6 episodes, 2022)', 'Jeremy Slater ... (created for television by) (6 episodes, 2022)', 'Jeremy Slater ... (6 episodes, 2022)', 'Peter Cameron ... (written by) (2 episodes, 2022)', 'Sabir Pirzada ... (written by) (2 episodes, 2022)', 'Beau DeMayo ... (written by) (1 episode, 2022)', 'Michael Kastelein ... (written by) (1 episode, 2022)', 'Alex Meenehan ... (written by) (1 episode, 2022)', 'Jack Kirby ... (Based on the Marvel comics by) (unknown episodes)', 'Stan Lee ... (Based on the Marvel comics by) (unknown episodes)']
Name of the movie: Spider-Man: No Way Home
Directors : ['Jon Watts']
Writers : ['Chris McKenna ... (written by) &', 'Erik Sommers ... (written by)', ' ', 'Stan Lee ... (based on the Marvel comic book by) and', 'Steve Ditko ... (based on the Marvel comic book by)']
Name of the movie: Dune
Directors : ['Denis Villeneuve ... (directed by)']
Writers : ['Jon Spaihts ... (screenplay by) and', 'Denis Villeneuve ... (screenplay by) and', 'Eric Roth ... (screenplay by)', ' ', 'Frank Herbert ... (based on the novel Dune written by)']

Answer 2

在imdb.com you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following 中提取每部电影的名称和导演和编剧：

使用CSS_SELECTOR:

driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#director +table > tbody tr > td > a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#writer +table > tbody tr > td > a")))])

使用 XPATH:

driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='director']//following::table[1]/tbody//tr/td/a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='writer']//following::table[1]/tbody//tr/td/a")))])

控制台输出：

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Answer 3

由于它是静态页面内容，您甚至不需要 selenium。您可以使用轻量级 python 请求模块和 Bs4.It 只是另一种方法。

import requests
from bs4 import BeautifulSoup

res=requests.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm")
result=res.text
soup=BeautifulSoup(result, 'html.parser')
directors=[director.text.strip() for director in soup.select("h4#director+table tr td.name>a")]
writers=[writer.text.strip() for writer in soup.select("h4#writer+table tr td.name>a")]

print(directors)
print(writers)

输出：

['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']

如何在 Python 中使用 Selenium 区分具有相同相对 XPATH 的两个表

How to distinguish two tables with the same relative XPATH with Selenium in Python

python

selenium

imdb

web-scraping