如何使用 Selenium 从网站中提取产品标题 Python
How to extract the product titles from the website using Selenium Python
我试图从网站上抓取标题,但它只返回 1 个标题。我怎样才能得到所有的标题?
下面是我尝试使用 xpath (starts-with) 获取的元素之一:
<div id="post-4550574" class="post-box " data-permalink="https://hypebeast.com/2019/4/undercover-nike-sfb-mountain-sneaker-release-info" data-title="The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date"><div class="post-box-image-container fixed-ratio-3-2">
这是我当前的代码:
from selenium import webdriver
import requests
from bs4 import BeautifulSoup as bs
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get('https://hypebeast.com/search?s=nike+undercover')
element = driver.find_element_by_xpath(".//*[starts-with(@id, 'post-')]")
print(element.get_attribute('data-title'))
输出:
The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date
我期待更多标题,但只返回一个结果。
如果定位器找到多个元素,则 find_elemnt
returns 第一个元素。 find_elements
returns 定位器找到的所有元素的列表。
然后你可以迭代列表并获取所有元素。
如果您要查找的所有元素都具有 class post-box
,那么您可以通过 class 名称找到这些元素。
要从 website as the desired elements are JavaScript enabled elements you need to induce WebDriverWait for the visibility_of_all_elements_located()
and you can use either of the following Locator Strategies 中提取 产品标题:
XPATH
:
driver.get('https://hypebeast.com/search?s=nike+undercover')
print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2/span")))])
CSS_SELECTOR
:
driver.get('https://hypebeast.com/search?s=nike+undercover')
print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2>span")))])
控制台输出:
['The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date', 'The UNDERCOVER x Nike SFB Mountain Surfaces in "Dark Obsidian/University Red"', 'A First Look at UNDERCOVER’s Nike SFB Mountain Collaboration', "Here's Where to Buy the UNDERCOVER x Gyakusou Nike Running Models", 'Take Another Look at the Upcoming UNDERCOVER x Nike Daybreak', "Take an Official Look at GYAKUSOU's SS19 Footwear and Apparel Range", 'UNDERCOVER x Nike Daybreak Expected to Hit Shelves This Summer', "The 10 Best Sneakers From Paris Fashion Week's FW19 Runways", "UNDERCOVER FW19 Debuts 'A Clockwork Orange' Theme, Nike & Valentino Collabs", 'These Are the Best Sneakers of 2018']
你不需要硒。您可以使用更快的 requests
,并针对 data-title
属性
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://hypebeast.com/search?s=nike+undercover')
soup = bs(r.content, 'lxml')
titles = [item['data-title'] for item in soup.select('[data-title]')]
print(titles)
如果你确实想要 selenium 匹配语法是
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://hypebeast.com/search?s=nike+undercover')
titles = [item.get_attribute('data-title') for item in driver.find_elements_by_css_selector('[data-title]')]
print(titles)
只是分享我的经验和我使用过的东西,可能会对某些人有所帮助。只需使用,
element.get_attribute('ATTRIBUTE-NAME')
我试图从网站上抓取标题,但它只返回 1 个标题。我怎样才能得到所有的标题?
下面是我尝试使用 xpath (starts-with) 获取的元素之一:
<div id="post-4550574" class="post-box " data-permalink="https://hypebeast.com/2019/4/undercover-nike-sfb-mountain-sneaker-release-info" data-title="The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date"><div class="post-box-image-container fixed-ratio-3-2">
这是我当前的代码:
from selenium import webdriver
import requests
from bs4 import BeautifulSoup as bs
driver = webdriver.Chrome('/Users/Documents/python/Selenium/bin/chromedriver')
driver.get('https://hypebeast.com/search?s=nike+undercover')
element = driver.find_element_by_xpath(".//*[starts-with(@id, 'post-')]")
print(element.get_attribute('data-title'))
输出:
The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date
我期待更多标题,但只返回一个结果。
如果定位器找到多个元素,则 find_elemnt
returns 第一个元素。 find_elements
returns 定位器找到的所有元素的列表。
然后你可以迭代列表并获取所有元素。
如果您要查找的所有元素都具有 class post-box
,那么您可以通过 class 名称找到这些元素。
要从 website as the desired elements are JavaScript enabled elements you need to induce WebDriverWait for the visibility_of_all_elements_located()
and you can use either of the following Locator Strategies 中提取 产品标题:
XPATH
:driver.get('https://hypebeast.com/search?s=nike+undercover') print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH, "//h2/span")))])
CSS_SELECTOR
:driver.get('https://hypebeast.com/search?s=nike+undercover') print([element.text for element in WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h2>span")))])
控制台输出:
['The UNDERCOVER x Nike SFB Mountain Pack Gets a Release Date', 'The UNDERCOVER x Nike SFB Mountain Surfaces in "Dark Obsidian/University Red"', 'A First Look at UNDERCOVER’s Nike SFB Mountain Collaboration', "Here's Where to Buy the UNDERCOVER x Gyakusou Nike Running Models", 'Take Another Look at the Upcoming UNDERCOVER x Nike Daybreak', "Take an Official Look at GYAKUSOU's SS19 Footwear and Apparel Range", 'UNDERCOVER x Nike Daybreak Expected to Hit Shelves This Summer', "The 10 Best Sneakers From Paris Fashion Week's FW19 Runways", "UNDERCOVER FW19 Debuts 'A Clockwork Orange' Theme, Nike & Valentino Collabs", 'These Are the Best Sneakers of 2018']
你不需要硒。您可以使用更快的 requests
,并针对 data-title
属性
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://hypebeast.com/search?s=nike+undercover')
soup = bs(r.content, 'lxml')
titles = [item['data-title'] for item in soup.select('[data-title]')]
print(titles)
如果你确实想要 selenium 匹配语法是
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://hypebeast.com/search?s=nike+undercover')
titles = [item.get_attribute('data-title') for item in driver.find_elements_by_css_selector('[data-title]')]
print(titles)
只是分享我的经验和我使用过的东西,可能会对某些人有所帮助。只需使用,
element.get_attribute('ATTRIBUTE-NAME')