使用 following-sibling 访问 following-sibling 内的 div

Using following-sibling to access divs within following-sibling

我正在尝试从中获取信息 URL:

https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event

我要提取文字"Hot 8 Brass Band are a Grammy-nominated New Orleans based brass band, whose sound... "

我的方法:我想在不使用显式 div 名称的情况下提取信息(因为这往往会改变。)因此,我使用变量识别“关于 Hot 8 Brass Band”,然后我想访问 following-siblings 和 child div 等

代码:

url = "https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event"

driver.get(url)


#Get artist
try:
    artist = driver.find_elements_by_css_selector('a[href^="https://www.bandsintown.com/a/"] h1')
    artist = artist[0].text
    print(artist)
except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
    print ("artist doesn't exist")



#Get Bio Info
try:
    readMoreBio = driver.find_element_by_xpath("//div[text()='Read More']").click()
    print("Read More Bio Clicked")
except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
    pass



#Once read more clicked, get full bio info
try:
    artistBioDiv = driver.find_elements_by_xpath("(//div[text()='About " + artist + "'])[0]/following-sibling/following-sibling::div")
    print("artistBioDiv is: ", artistBioDiv)

except (ElementNotVisibleException, NoSuchElementException, TimeoutException):
    print ("artist bio doesn't exist")

这似乎访问了一个空元素,即它没有找到 bio 段落。

这是 HTML 结构:

我认为问题出在您用来查找简介的 XPATH 上。

您可以为未来的项目考虑的一些事项:

  • 使用 driver.find_element(By.CSS_SELECTOR, 'CSS_SELECTOR_GOES_HERE')driver.find_element(By.XPATH, 'XPATH_GOES_HERE'),因为 find_elements_by_xpathfind_elements_by_css_selector 已弃用
  • 使用WebDriverWait为加载元素留出足够的时间
  • 您还可以在 xpath 中匹配文本时使用 normalize-space(),因为它会处理前导或尾随空格

此代码应该适合您:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from selenium.webdriver.chrome.options import Options
from time import sleep


options = Options()
options.add_argument("--disable-notifications")

driver = webdriver.Chrome(executable_path='D://chromedriver/100/chromedriver.exe', options=options)
wait = WebDriverWait(driver, 20)

url = "https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event"

driver.get(url)

try:
    # with xpath
    # artist = wait.until(EC.presence_of_element_located((By.XPATH, '//h1[contains(@href, "https://www.bandsintown.com/a")]'))).text
    artist = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, 'h1[href^="https://www.bandsintown.com/a/"]'))).text
    
    # read more
    wait.until(EC.presence_of_element_located((By.XPATH, '//div[normalize-space()="Read More"]'))).click()
    
    # bio
    bio = wait.until(EC.presence_of_element_located((By.XPATH, f'//div[normalize-space()="About {artist}"]/following-sibling::div/div[2]/div'))).text
    print(f'Artist: {artist}\nBio:\n{bio}')
except Exception as ex:
    print(f"Error: {ex})

要提取文本 ...Hot 8 Brass Band 是一支 Grammy-nominated 新奥尔良的铜管乐队,其声音... ... 您可以使用以下任一项 :

  • 使用 xpathtext 属性:

    driver.get("https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event")
    print(driver.find_element(By.XPATH, "//div[@id='main']//div[text()='About Hot 8 Brass Band']//following-sibling::div[1]//div/div[contains(., 'Hot 8 Brass Band')]").text)
    

理想情况下你需要诱导 WebDriverWait for the and you can use either of the following :

  • 使用 XPATHget_attribute("innerHTML"):

    driver.get("https://www.bandsintown.com/e/1024477910-hot-8-brass-band-at-the-howlin'-wolf?came_from=253&utm_medium=web&utm_source=city_page&utm_campaign=event")
    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[@id='main']//div[text()='About Hot 8 Brass Band']//following-sibling::div[1]//div/div[contains(., 'Hot 8 Brass Band')]"))).get_attribute("innerHTML"))
    
  • 控制台输出:

    Hot 8 Brass Band are a Grammy-nominated New Orleans based brass band, whose sound draws on the traditional jazz heritage of New Orleans, alongside more modern styles incl...
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in