BeautifulSoup 不返回网站上的搜索结果

Question

我正在尝试将 link 获取到网站（国家美术馆）上的各个搜索结果。但是搜索的 link 不会加载搜索结果。以下是我尝试这样做的方法：

url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

我可以看到 link 可以在 soup.findAll('a') 下找到各个结果，但它们没有出现，最后的输出是一个 link 到空搜索结果: https://www.nga.gov/content/ngaweb/collection-search-result.html

如何获得 link 的列表，其中第一个是第一个搜索结果 (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) 等等？

Answer 1

这似乎对我有用：


from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
    print(a['href'])

它returns所有html一个href链接。

具体来说，对于来自搜索结果的链接，这些链接是通过 AJAX 加载的，您需要实现一些东西，使 javascript 像无头 chrome 一样呈现。您可以在此处阅读其中一种实现方法，它非常适合您的用例。 http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/

如果您想询问如何从 python 渲染 javascript 然后解析结果，您需要关闭此问题并打开一个新问题，因为它的范围不正确是。

Answer 2

实际上，数据是从 api 调用 json 响应生成的。这是想要的链接列表。

代码：

import requests
import json

url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)

for item in r.json()['results']:
    url = item['url']
    abs_url = f'https://www.nga.gov{url}'
    print(abs_url)

输出：

https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html 
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html

BeautifulSoup 不返回网站上的搜索结果

BeautifulSoup not returning results of a search on a website

python

beautifulsoup

web-search