Selenium/BeautifulSoup - WebScrape 这个字段

Question

我的代码运行良好并打印所有行的标题，但带有下拉菜单的行除外。

例如，如果单击第 4 行，则会有一个下拉菜单。我实现了一个 'try'，理论上它会点击下拉菜单，然后拉出标题。

但是当我执行 click() 并尝试打印时，对于带有这些下拉列表的行，它们没有打印。

预期输出 - 打印所有标题，包括下拉列表中的标题。

有用户提交了关于此问题的回答 link 但他的回答格式不同，我不知道如何添加日期、时间等字段、椅子或顶部的字段，用他的方法写着“按需”

任何方法都将不胜感激，希望将其放入数据框中。谢谢

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)

driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')

new_titles = set()

productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
    actions.move_to_element_with_offset(property,0,0).perform()
    time.sleep(4.5)
    sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
    #print(sessiontitle)
    ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
    if(ifDropdown):
        ifDropdown[0].click()
        time.sleep(4)
        open_titles = driver.find_elements_by_class_name('card-title')
        for open_title in open_titles:
            title = open_title.text
            if(title not in new_titles):
                print(title)
                new_titles.add(title)

Answer 1

您的问题出在 driver.find_elements_by_class_name('item-expand-action expand') 命令上。 find_elements_by_class_name('item-expand-action expand') 定位符错误。这些网络元素有多个 class 名称。要定位这些元素，您可以使用 css_selector 或 XPath。
此外，由于有几个带有下拉菜单的元素，要对它们执行点击，您应该遍历它们。您不能对网络元素列表执行 .click()。
所以你的代码应该是这样的：

ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
    drop_down.click()
    time.sleep(0.5)

除了上述 css_selector，您还可以使用 XPath：

ifDropdown=driver.find_elements_by_xpath('//a[@class="item-expand-action expand"]')

UPD
如果您想打印添加的新标题，您可以这样做：

ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
    drop_down.click()
    time.sleep(0.5)
newTitles=driver.find_elements_by_class_name('card-title')
for new_title in newTitles:
    print(new_title.text)

在这里展开所有下拉元素后，我得到了所有新标题，然后遍历该列表打印每个元素文本。
driver.find_elements_by_class_name returns 网络元素列表。您不能在列表上应用 .text，您必须遍历列表元素，每次都获取每个单个元素文本。
UPD2
整个代码打开下拉菜单并打印其内部标题可以是这样的：
我是用 Selenium 做的，而不是与 bs4 混合。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains

import time
driver = webdriver.Chrome()
actions = ActionChains(driver)

driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')

new_titles = set()

productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
    actions.move_to_element(property).perform()
    time.sleep(0.5)
    sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
    print(sessiontitle)
    ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
    if(ifDropdown):
        ifDropdown[0].click()
        time.sleep(4)
        open_titles = driver.find_elements_by_class_name('card-title')
        for open_title in open_titles:
            title = open_title.text
            if(title not in new_titles):
                print(title)
                new_titles.add(title)

我在这里检查是否有下拉菜单。如果有，我打开它。然后获取所有当前打开的标题。对于每个这样的标题，我都会验证它是新的还是之前打开过的。如果标题是新的，不存在于集合中，我会打印它并将其添加到集合中。

Answer 2

要获取所有数据，包括日期、时间、椅子，，您只能使用requests/BeautifulSoup。不需要 Selenium.

import requests
import pandas as pd
from bs4 import BeautifulSoup


data = []
url = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p={}"

for page in range(1, 5):  # <-- Increase number of pages here
    with requests.Session() as session:
        soup = BeautifulSoup(session.get(url.format(page)).content, "html.parser")
        for card in soup.select("div.card-block"):
            title = card.find(class_="session-title card-title").get_text()
            date = card.select_one(".internal_date div.property").get_text(strip=True)
            time = card.select_one(".internal_time div.property").get_text()
            try:
                chairs = card.select_one(".persons").get_text(strip=True)
            except AttributeError:
                chairs = "N/A"

            data.append({"title": title, "date": date, "time": time, "chairs": chairs})

df = pd.DataFrame(data)
print(df.to_string())

输出（截断）：

                                                                                                                                                                         title             date           time                                                                    chairs
0                                                                                                                                                Educational sessions on-demand  Thu, 16.09.2021  08:30 - 09:40                                                                       N/A
1                                                                                                                                                    Special Symposia on-demand  Thu, 16.09.2021  12:30 - 13:40                                                                       N/A
2                                                                                                                                          Multidisciplinary sessions on-demand  Thu, 16.09.2021  16:30 - 17:40                                                                       N/A
3                                                                                                                    MSD - Homologous Recombination Deficiency: BRCA and beyond  Fri, 17.09.2021  08:45 - 09:55                       Frederique Penault-Llorca(Clermont-Ferrand, France)
4                                                                                                          Servier - The clinical value of IDH inhibition in cholangiocarcinoma  Fri, 17.09.2021  08:45 - 10:15  Arndt Vogel(Hannover, Germany)Angela Lamarca(Manchester, United Kingdom)
5                                                                                                                   AstraZeneca - Redefining Breast Cancer – Biology to Therapy  Fri, 17.09.2021  08:45 - 10:15                                Ian Krop(Boston, United States of America)

Selenium/BeautifulSoup - WebScrape 这个字段

Selenium/BeautifulSoup - WebScrape this field

selenium

beautifulsoup

request

web-scraping