似乎无法从网页上抓取特定信息？

Question

我正在尝试为以下页面上显示的每个项目抓取一些信息： https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&

但是，我似乎无法访问项目信息。我需要的信息是每个产品的名称和 link，例如第一个项目包含在：

<a class="catalog_item_name" aria-hidden="true" tabindex="-1" id="WC_CatalogEntryDBThumbnailDisplayJSPF_3074457345616901168_link_9b" href="/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&storeId=10051&productId=3074457345616901168&langId=-1&partNumber=000086630prod&errorViewName=ProductDisplayErrorView&categoryId=1351370&top_category=25208&parent_category_rn=25208&urlLangId=&variety=New+Spirits&categoryType=Spirits&fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0">Woodford Reserve Master Collection Five Malt Stouted Mash</a>

所以我要抓取的信息是：

Woodford Reserve Master Collection Five Malt Stouted Mash

和

/webapp/wcs/stores/servlet/ProductDisplay?catalogId=10051&storeId=10051&productId=3074457345616901168&langId=-1&partNumber=000086630prod&errorViewName=ProductDisplayErrorView&categoryId=1351370&top_category=25208&parent_category_rn=25208&urlLangId=&variety=New+Spirits&categoryType=Spirits&fromURL=%2fwebapp%2fwcs%2fstores%2fservlet%2fCatalogSearchResultView%3fstoreId%3d10051%26catalogId%3d10051%26langId%3d-1%26categoryId%3d1351370%26variety%3dNew%2bSpirits%26categoryType%3dSpirits%26top_category%3d25208%26parent_category_rn%3d%26sortBy%3d0%26searchSource%3dE%26pageView%3d%26beginIndex%3d0

我正在尝试为页面上的每个项目重复此操作。我肯定会连接到该页面，但由于某种原因我无法使用 for product in soup.select 抓取任何信息下面是我的脚本的简化版本，我一直在尝试从上面收集信息 catalog_item_name

import requests
import sys
import time
import smtplib
from email.message import EmailMessage
import hashlib
from urllib.request import urlopen
from datetime import datetime
import json
import random
import requests
from itertools import cycle
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib3.exceptions import InsecureRequestWarning

from requests_html import HTMLSession
session = HTMLSession()


user_agent_list = [
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
]
for i in range(1,4):
    #Pick a random user agent
    user_agent = random.choice(user_agent_list)



url = []
url = 'https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CatalogSearchResultView?storeId=10051&catalogId=10051&langId=-1&categoryId=1351370&variety=New+Spirits&categoryType=Spirits&top_category=25208&sortBy=0&searchSource=E&pageView=&beginIndex=0#facet:&productBeginIndex:0&orderBy:&pageView:&minPrice:&maxPrice:&pageSize:&'

response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,features="html.parser")
link = []

for product in soup.select('a.catalog_item_name'):
    link.append(product)

print(link)

如有任何帮助，我们将不胜感激！

编辑：在其他两个网站上测试了该脚本，它工作正常。一定是网站有什么问题导致它掉线了？

Answer 1

我想这里最好的方法是检查网络流量并直接查询 API。例如，对于以上 url，有一些 POST 请求针对 https://www.finewineandgoodspirits.com/webapp/wcs/stores/servlet/CategoryProductsListingView 的 API。

我可以用它来获取产品列表，即：

from bs4 import BeautifulSoup
import requests
import urllib

base_url = 'https://www.finewineandgoodspirits.com'
path = '/webapp/wcs/stores/servlet/CategoryProductsListingView?sType=SimpleSearch&resultsPerPage=15&sortBy=0&disableProductCompare=false&ajaxStoreImageDir=%2fwcsstore%2fWineandSpirits%2f&variety=New+Spirits&categoryType=Spirits&ddkey=ProductListingView'
params = {
    'storeId': '10051',
    'categoryId': '1351370',
    'searchType': '1002'
}

headers = {
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded',
    'User-Agent': 'some super fancy browser',
}

request_url = base_url + path + '&' + urllib.parse.urlencode(params)
response = requests.post(request_url, headers=headers)
soup = BeautifulSoup(response.text)

# now, extract the content form the soup, eg like you did above
product_links: list[str] = [base_url + a['href'] for a in soup.select('a.catalog_item_name')]

似乎无法从网页上抓取特定信息？

Can't seem to scrape specific information from webpage?

python

beautifulsoup

web-scraping