如何使用 beautifulsoup 抓取动态内容?

How to scrape dynamic content with beautifulsoup?

这是我的脚本:

import warnings
warnings.filterwarnings("ignore")

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']


TypeVendor = []
NameVendor = []
Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
Links = []


#df = pd.read_csv('testlink4.csv')

n=1

for url in URLs:

    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")

    TypeVendor.append('Distributeur')

    NameVendor.append('Frayssinet')

    Marques.append('Longines')

    Brands.append(soup.find('span', class_ = 'main-detail__name').text)

    Refs.append(soup.find('span', class_ = 'main-detail__ref').text)

    Prices.append(soup.find('span', class_ = 'prix').text)

    Links.append(url)

我明白为什么它不起作用,text 不适用于动态内容。但我无法弄清楚如何抓取此类内容。我知道如果你找到 json 数据的存储位置,你可以调整它并抓取数据。

但我检查了 google 开发人员工具,在网络选项卡上,我没有找到任何东西。

headers设置为您的请求并以更结构化的方式存储您的信息。

例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']

data = []
for url in URLs:

    results = requests.get(url,headers=headers)
    soup = BeautifulSoup(results.text, "html.parser")
    data.append({
        'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
        'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
        'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
        'price':soup.find('span', {'itemprop':'price'}).get('content'),
        'url':url
    })

pd.DataFrame(data)

输出

name brand ref price url
Montre The Longines Legend Diver L3.774.4.30.2 Longines Référence : L3.774.4.30.2 2240 https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2