如何使用 beautifulsoup 抓取动态内容？

Question

这是我的脚本：

import warnings
warnings.filterwarnings("ignore")

import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']


TypeVendor = []
NameVendor = []
Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
Links = []


#df = pd.read_csv('testlink4.csv')

n=1

for url in URLs:

    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")

    TypeVendor.append('Distributeur')

    NameVendor.append('Frayssinet')

    Marques.append('Longines')

    Brands.append(soup.find('span', class_ = 'main-detail__name').text)

    Refs.append(soup.find('span', class_ = 'main-detail__ref').text)

    Prices.append(soup.find('span', class_ = 'prix').text)

    Links.append(url)

我明白为什么它不起作用，text 不适用于动态内容。但我无法弄清楚如何抓取此类内容。我知道如果你找到 json 数据的存储位置，你可以调整它并抓取数据。

但我检查了 google 开发人员工具，在网络选项卡上，我没有找到任何东西。

Answer 1

将headers设置为您的请求并以更结构化的方式存储您的信息。

例子

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']

data = []
for url in URLs:

    results = requests.get(url,headers=headers)
    soup = BeautifulSoup(results.text, "html.parser")
    data.append({
        'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
        'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
        'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
        'price':soup.find('span', {'itemprop':'price'}).get('content'),
        'url':url
    })

pd.DataFrame(data)

输出

name	brand	ref	price	url
Montre The Longines Legend Diver L3.774.4.30.2	Longines	Référence : L3.774.4.30.2	2240	https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2

如何使用 beautifulsoup 抓取动态内容？

How to scrape dynamic content with beautifulsoup?

dynamic

beautifulsoup

web-scraping

python-3.x

例子

输出