Selenium/BeautifulSoup - WebScrape 这个字段
Selenium/BeautifulSoup - WebScrape this field
我的代码运行良好并打印所有行的标题,但带有下拉菜单的行除外。
例如,如果单击第 4 行,则会有一个下拉菜单。我实现了一个 'try',理论上它会点击下拉菜单,然后拉出标题。
但是当我执行 click() 并尝试打印时,对于带有这些下拉列表的行,它们没有打印。
预期输出 - 打印所有标题,包括下拉列表中的标题。
有用户提交了关于此问题的回答 link 但他的回答格式不同,我不知道如何添加日期、时间等字段、椅子或顶部的字段,用他的方法写着“按需”
任何方法都将不胜感激,希望将其放入数据框中。谢谢
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element_with_offset(property,0,0).perform()
time.sleep(4.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
#print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
您的问题出在 driver.find_elements_by_class_name('item-expand-action expand')
命令上。 find_elements_by_class_name('item-expand-action expand')
定位符错误。这些网络元素有多个 class 名称。要定位这些元素,您可以使用 css_selector 或 XPath。
此外,由于有几个带有下拉菜单的元素,要对它们执行点击,您应该遍历它们。您不能对网络元素列表执行 .click()
。
所以你的代码应该是这样的:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
除了上述 css_selector,您还可以使用 XPath:
ifDropdown=driver.find_elements_by_xpath('//a[@class="item-expand-action expand"]')
UPD
如果您想打印添加的新标题,您可以这样做:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
newTitles=driver.find_elements_by_class_name('card-title')
for new_title in newTitles:
print(new_title.text)
在这里展开所有下拉元素后,我得到了所有新标题,然后遍历该列表打印每个元素文本。
driver.find_elements_by_class_name
returns 网络元素列表。您不能在列表上应用 .text
,您必须遍历列表元素,每次都获取每个单个元素文本。
UPD2
整个代码打开下拉菜单并打印其内部标题可以是这样的:
我是用 Selenium 做的,而不是与 bs4 混合。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element(property).perform()
time.sleep(0.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
我在这里检查是否有下拉菜单。如果有,我打开它。然后获取所有当前打开的标题。对于每个这样的标题,我都会验证它是新的还是之前打开过的。如果标题是新的,不存在于集合中,我会打印它并将其添加到集合中。
要获取所有数据,包括日期、时间、椅子,,您只能使用requests
/BeautifulSoup
。不需要 Selenium
.
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
url = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p={}"
for page in range(1, 5): # <-- Increase number of pages here
with requests.Session() as session:
soup = BeautifulSoup(session.get(url.format(page)).content, "html.parser")
for card in soup.select("div.card-block"):
title = card.find(class_="session-title card-title").get_text()
date = card.select_one(".internal_date div.property").get_text(strip=True)
time = card.select_one(".internal_time div.property").get_text()
try:
chairs = card.select_one(".persons").get_text(strip=True)
except AttributeError:
chairs = "N/A"
data.append({"title": title, "date": date, "time": time, "chairs": chairs})
df = pd.DataFrame(data)
print(df.to_string())
输出(截断):
title date time chairs
0 Educational sessions on-demand Thu, 16.09.2021 08:30 - 09:40 N/A
1 Special Symposia on-demand Thu, 16.09.2021 12:30 - 13:40 N/A
2 Multidisciplinary sessions on-demand Thu, 16.09.2021 16:30 - 17:40 N/A
3 MSD - Homologous Recombination Deficiency: BRCA and beyond Fri, 17.09.2021 08:45 - 09:55 Frederique Penault-Llorca(Clermont-Ferrand, France)
4 Servier - The clinical value of IDH inhibition in cholangiocarcinoma Fri, 17.09.2021 08:45 - 10:15 Arndt Vogel(Hannover, Germany)Angela Lamarca(Manchester, United Kingdom)
5 AstraZeneca - Redefining Breast Cancer – Biology to Therapy Fri, 17.09.2021 08:45 - 10:15 Ian Krop(Boston, United States of America)
我的代码运行良好并打印所有行的标题,但带有下拉菜单的行除外。
例如,如果单击第 4 行,则会有一个下拉菜单。我实现了一个 'try',理论上它会点击下拉菜单,然后拉出标题。
但是当我执行 click() 并尝试打印时,对于带有这些下拉列表的行,它们没有打印。
预期输出 - 打印所有标题,包括下拉列表中的标题。
有用户提交了关于此问题的回答 link
任何方法都将不胜感激,希望将其放入数据框中。谢谢
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element_with_offset(property,0,0).perform()
time.sleep(4.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
#print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
您的问题出在 driver.find_elements_by_class_name('item-expand-action expand')
命令上。 find_elements_by_class_name('item-expand-action expand')
定位符错误。这些网络元素有多个 class 名称。要定位这些元素,您可以使用 css_selector 或 XPath。
此外,由于有几个带有下拉菜单的元素,要对它们执行点击,您应该遍历它们。您不能对网络元素列表执行 .click()
。
所以你的代码应该是这样的:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
除了上述 css_selector,您还可以使用 XPath:
ifDropdown=driver.find_elements_by_xpath('//a[@class="item-expand-action expand"]')
UPD
如果您想打印添加的新标题,您可以这样做:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
newTitles=driver.find_elements_by_class_name('card-title')
for new_title in newTitles:
print(new_title.text)
在这里展开所有下拉元素后,我得到了所有新标题,然后遍历该列表打印每个元素文本。
driver.find_elements_by_class_name
returns 网络元素列表。您不能在列表上应用 .text
,您必须遍历列表元素,每次都获取每个单个元素文本。
UPD2
整个代码打开下拉菜单并打印其内部标题可以是这样的:
我是用 Selenium 做的,而不是与 bs4 混合。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element(property).perform()
time.sleep(0.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
我在这里检查是否有下拉菜单。如果有,我打开它。然后获取所有当前打开的标题。对于每个这样的标题,我都会验证它是新的还是之前打开过的。如果标题是新的,不存在于集合中,我会打印它并将其添加到集合中。
要获取所有数据,包括日期、时间、椅子,,您只能使用requests
/BeautifulSoup
。不需要 Selenium
.
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
url = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p={}"
for page in range(1, 5): # <-- Increase number of pages here
with requests.Session() as session:
soup = BeautifulSoup(session.get(url.format(page)).content, "html.parser")
for card in soup.select("div.card-block"):
title = card.find(class_="session-title card-title").get_text()
date = card.select_one(".internal_date div.property").get_text(strip=True)
time = card.select_one(".internal_time div.property").get_text()
try:
chairs = card.select_one(".persons").get_text(strip=True)
except AttributeError:
chairs = "N/A"
data.append({"title": title, "date": date, "time": time, "chairs": chairs})
df = pd.DataFrame(data)
print(df.to_string())
输出(截断):
title date time chairs
0 Educational sessions on-demand Thu, 16.09.2021 08:30 - 09:40 N/A
1 Special Symposia on-demand Thu, 16.09.2021 12:30 - 13:40 N/A
2 Multidisciplinary sessions on-demand Thu, 16.09.2021 16:30 - 17:40 N/A
3 MSD - Homologous Recombination Deficiency: BRCA and beyond Fri, 17.09.2021 08:45 - 09:55 Frederique Penault-Llorca(Clermont-Ferrand, France)
4 Servier - The clinical value of IDH inhibition in cholangiocarcinoma Fri, 17.09.2021 08:45 - 10:15 Arndt Vogel(Hannover, Germany)Angela Lamarca(Manchester, United Kingdom)
5 AstraZeneca - Redefining Breast Cancer – Biology to Therapy Fri, 17.09.2021 08:45 - 10:15 Ian Krop(Boston, United States of America)