如何在带有滚动条的页面中获取 ul 中的所有 href
How do I get all href in an ul in a page with a scrollbar
我想获取此 ul 中这些 li 中的所有 href:
Click here to see screenshot
到目前为止我写了这一行:
import bs4, requests, re
product_pages = []
def get_product_pages(openurl):
global product_pages
url = 'https://www.ah.nl/producten/aardappel-groente-fruit'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for li in soup.findAll('li', attrs={'class': 'taxonomy-sub-selector_root__3rtWx'}):
for a in li.findAll('a', href=True):
print(a.attrs['href'])
get_product_pages('')
但它只给我前三里的 href。我想知道为什么只有前三个,我想知道如何获得全部八个..
页面有滚动条,会不会有问题?
分类法和所有其他页面数据存储在 <script>
的页面内,因此 beautifulsoup 看不到它。要从当前类别中获取所有子分类法,您可以使用下一个示例(使用 re
/json
解析 <script>
标签):
import re
import json
import requests
base_url = "https://www.ah.nl/producten"
url = base_url + "/aardappel-groente-fruit/fruit"
html_doc = requests.get(url).text
data = re.search(r"window\.__INITIAL_STATE__= ({.*})", html_doc)
data = data.group(1).replace("undefined", "null")
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
taxonomies = {t["id"]: t for t in data["taxonomy"]["topLevel"]}
for t in data["taxonomy"]["taxonomies"]:
taxonomies[t["id"]] = t
def get_taxonomy(t, current, dupl=None):
if dupl is None:
dupl = set()
tmp = current + "/" + t["slugifiedName"]
yield tmp
for c in t["children"]:
if c in taxonomies and c not in dupl:
dupl.add(c)
yield from get_taxonomy(taxonomies[c], tmp, dupl)
for t in taxonomies.values():
if t["parents"] == [0]:
for t in get_taxonomy(t, base_url):
if url in t: # print only URL from current category
print(t)
打印:
https://www.ah.nl/producten/aardappel-groente-fruit/fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels/groente-en-fruitbox
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bananen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/sinaasappels-mandarijnen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/peren
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/ananas-mango-kiwi
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/aardbeien-frambozen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/druiven-kersen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bramen-bessen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen/exotisch-fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/perziken-nectarines
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/meloen-kokosnoot
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/grapefruit-minneola
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/citroen-limoen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruit-spread
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/vijgen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/kaki-papaya-cherimoya
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/granaatappel-passiefruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruitsalade-mix
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/gedroogd-fruit
我想获取此 ul 中这些 li 中的所有 href: Click here to see screenshot
到目前为止我写了这一行:
import bs4, requests, re
product_pages = []
def get_product_pages(openurl):
global product_pages
url = 'https://www.ah.nl/producten/aardappel-groente-fruit'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for li in soup.findAll('li', attrs={'class': 'taxonomy-sub-selector_root__3rtWx'}):
for a in li.findAll('a', href=True):
print(a.attrs['href'])
get_product_pages('')
但它只给我前三里的 href。我想知道为什么只有前三个,我想知道如何获得全部八个..
页面有滚动条,会不会有问题?
分类法和所有其他页面数据存储在 <script>
的页面内,因此 beautifulsoup 看不到它。要从当前类别中获取所有子分类法,您可以使用下一个示例(使用 re
/json
解析 <script>
标签):
import re
import json
import requests
base_url = "https://www.ah.nl/producten"
url = base_url + "/aardappel-groente-fruit/fruit"
html_doc = requests.get(url).text
data = re.search(r"window\.__INITIAL_STATE__= ({.*})", html_doc)
data = data.group(1).replace("undefined", "null")
data = json.loads(data)
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
taxonomies = {t["id"]: t for t in data["taxonomy"]["topLevel"]}
for t in data["taxonomy"]["taxonomies"]:
taxonomies[t["id"]] = t
def get_taxonomy(t, current, dupl=None):
if dupl is None:
dupl = set()
tmp = current + "/" + t["slugifiedName"]
yield tmp
for c in t["children"]:
if c in taxonomies and c not in dupl:
dupl.add(c)
yield from get_taxonomy(taxonomies[c], tmp, dupl)
for t in taxonomies.values():
if t["parents"] == [0]:
for t in get_taxonomy(t, base_url):
if url in t: # print only URL from current category
print(t)
打印:
https://www.ah.nl/producten/aardappel-groente-fruit/fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/appels/groente-en-fruitbox
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bananen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/sinaasappels-mandarijnen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/peren
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/ananas-mango-kiwi
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/aardbeien-frambozen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/druiven-kersen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/bramen-bessen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/abrikozen-pruimen/exotisch-fruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/perziken-nectarines
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/meloen-kokosnoot
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/grapefruit-minneola
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/citroen-limoen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruit-spread
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/vijgen
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/kaki-papaya-cherimoya
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/granaatappel-passiefruit
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/fruitsalade-mix
https://www.ah.nl/producten/aardappel-groente-fruit/fruit/gedroogd-fruit