Python Web 抓取动态内容

Question

我一直在尝试抓取 kith.com 搜索结果，但我得到了框架示例代码。尝试使用 scrapy、requests-html 和 selenium，但我没能使它们工作。

现在我的代码是：

from requests_html import HTMLSession

session = HTMLSession()
r = session.get("https://kith.com/pages/search-results-page?q=nike&tab=products&sort_by=created")

r.html.render()
print(r)

据我所知，render() 应该得到 html 代码，就像它在浏览器中看到的那样，但我仍然得到相同的 "raw" 代码。

PD：kith.com 是一个 shopify 店铺

Answer 1

Selenium适合这样的工作

from selenium import webdriver
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
driver.get('https://kith.com/pages/search-results-page?q=nike&tab=products&sort_by=created')


item_titles = driver.find_elements_by_class_name("snize-title")

print item_titles[0].text
#NIKE WMNS SHOX TL - NOVA WHITE / TEAM ORANGE / SPRUCE AURA

编辑：

如果您想捕获所有项目信息，div 元素 snize-overhidden class 将是您要捕获的内容。然后你可以遍历它们和它们的子元素

Python Web 抓取动态内容

Python Web Scraping Dynamic Content

python

web-scraping

python-requests

python-requests-html