BeautifulSoup 找不到具有特定 class 的 div
BeautifulSoup can't find div with specific class
因此,在某些背景下,我一直在尝试学习网络抓取,以便为涉及 CNN 的机器学习项目获取一些图像。我一直在尝试从网站上抓取一些图像(HTML 代码在左边,我的代码在右边)但没有成功;我的代码最终 printing/returning 是一个空列表。我做错了什么吗?
为了它的价值,我尝试找到其他 div 标签,这些标签有 'id' 而不是 'class' 并且确实有效,但由于某种原因它找不到我正在寻找的那些。
编辑:
import requests
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
url = 'https://www.grailed.com/shop/EkpEBRw4rw'
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')
img_div = soup.findAll('div', {'class': "listing-cover-photo "})
print(img_div)
编辑 2:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.grailed.com/shop/EkpEBRw4rw'
driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
listing = soup.select('.listing-cover-photo ')
for item in listing:
print(item.select('img'))
输出:
[<img alt="Off-White Off White Caravaggio Hoodie" src="https://process.fs.grailed.com/AJdAgnqCST4iPtnUxiGtTz/cache=expiry:max/rotate=deg:exif/resize=width:480,height:640,fit:crop/output=format:webp,quality:70/compress/https://cdn.fs.grailed.com/api/file/yX8vvvBsTaugadX0jssT"/>]
(...a few more of these...)
[<img alt="Off-White Off-White Arrows Hoodie Black" src="https://process.fs.grailed.com/AJdAgnqCST4iPtnUxiGtTz/cache=expiry:max/rotate=deg:exif/resize=width:480,height:640,fit:crop/output=format:webp,quality:70/compress/https://cdn.fs.grailed.com/api/file/9CMvJoQIRaqgtK0u9ov0"/>]
[]
[]
[]
[]
(...many more empty lists...)
看起来网站正在使用 JavaScript.Try 使用 Selenium 和 beautiful soup 加载数据。
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.grailed.com/shop/EkpEBRw4rw"
browser = webdriver.Chrome(executable_path="/path/to/chromedriver.exe")
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
items=soup.select(".listing-cover-photo ")
print(items)
因此,在某些背景下,我一直在尝试学习网络抓取,以便为涉及 CNN 的机器学习项目获取一些图像。我一直在尝试从网站上抓取一些图像(HTML 代码在左边,我的代码在右边)但没有成功;我的代码最终 printing/returning 是一个空列表。我做错了什么吗?
为了它的价值,我尝试找到其他 div 标签,这些标签有 'id' 而不是 'class' 并且确实有效,但由于某种原因它找不到我正在寻找的那些。
编辑:
import requests
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
url = 'https://www.grailed.com/shop/EkpEBRw4rw'
response = http.request('GET', url)
soup = BeautifulSoup(response.data, 'html.parser')
img_div = soup.findAll('div', {'class': "listing-cover-photo "})
print(img_div)
编辑 2:
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'https://www.grailed.com/shop/EkpEBRw4rw'
driver = webdriver.Chrome(executable_path='chromedriver.exe')
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
listing = soup.select('.listing-cover-photo ')
for item in listing:
print(item.select('img'))
输出:
[<img alt="Off-White Off White Caravaggio Hoodie" src="https://process.fs.grailed.com/AJdAgnqCST4iPtnUxiGtTz/cache=expiry:max/rotate=deg:exif/resize=width:480,height:640,fit:crop/output=format:webp,quality:70/compress/https://cdn.fs.grailed.com/api/file/yX8vvvBsTaugadX0jssT"/>]
(...a few more of these...)
[<img alt="Off-White Off-White Arrows Hoodie Black" src="https://process.fs.grailed.com/AJdAgnqCST4iPtnUxiGtTz/cache=expiry:max/rotate=deg:exif/resize=width:480,height:640,fit:crop/output=format:webp,quality:70/compress/https://cdn.fs.grailed.com/api/file/9CMvJoQIRaqgtK0u9ov0"/>]
[]
[]
[]
[]
(...many more empty lists...)
看起来网站正在使用 JavaScript.Try 使用 Selenium 和 beautiful soup 加载数据。
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.grailed.com/shop/EkpEBRw4rw"
browser = webdriver.Chrome(executable_path="/path/to/chromedriver.exe")
browser.get(url)
soup = BeautifulSoup(browser.page_source,"html.parser")
items=soup.select(".listing-cover-photo ")
print(items)