如何使用 beautifulsoup 抓取动态内容?
How to scrape dynamic content with beautifulsoup?
这是我的脚本:
import warnings
warnings.filterwarnings("ignore")
import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']
TypeVendor = []
NameVendor = []
Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
Links = []
#df = pd.read_csv('testlink4.csv')
n=1
for url in URLs:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
TypeVendor.append('Distributeur')
NameVendor.append('Frayssinet')
Marques.append('Longines')
Brands.append(soup.find('span', class_ = 'main-detail__name').text)
Refs.append(soup.find('span', class_ = 'main-detail__ref').text)
Prices.append(soup.find('span', class_ = 'prix').text)
Links.append(url)
我明白为什么它不起作用,text
不适用于动态内容。但我无法弄清楚如何抓取此类内容。我知道如果你找到 json 数据的存储位置,你可以调整它并抓取数据。
但我检查了 google 开发人员工具,在网络选项卡上,我没有找到任何东西。
将headers
设置为您的请求并以更结构化的方式存储您的信息。
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']
data = []
for url in URLs:
results = requests.get(url,headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
data.append({
'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
'price':soup.find('span', {'itemprop':'price'}).get('content'),
'url':url
})
pd.DataFrame(data)
输出
name
brand
ref
price
url
Montre The Longines Legend Diver L3.774.4.30.2
Longines
Référence : L3.774.4.30.2
2240
https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2
这是我的脚本:
import warnings
warnings.filterwarnings("ignore")
import re
import json
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']
TypeVendor = []
NameVendor = []
Marques = []
Brands = []
Refs = []
Prices = []
#Carts = []
#Links = []
Links = []
#df = pd.read_csv('testlink4.csv')
n=1
for url in URLs:
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
TypeVendor.append('Distributeur')
NameVendor.append('Frayssinet')
Marques.append('Longines')
Brands.append(soup.find('span', class_ = 'main-detail__name').text)
Refs.append(soup.find('span', class_ = 'main-detail__ref').text)
Prices.append(soup.find('span', class_ = 'prix').text)
Links.append(url)
我明白为什么它不起作用,text
不适用于动态内容。但我无法弄清楚如何抓取此类内容。我知道如果你找到 json 数据的存储位置,你可以调整它并抓取数据。
但我检查了 google 开发人员工具,在网络选项卡上,我没有找到任何东西。
将headers
设置为您的请求并以更结构化的方式存储您的信息。
例子
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0'}
URLs = ['https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2']
data = []
for url in URLs:
results = requests.get(url,headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
data.append({
'name': soup.find('span', class_ = 'main-detail__name').get_text(strip=True),
'brand': soup.find('span', class_ = 'main-detail__marque').get_text(strip=True),
'ref':soup.find('span', class_ = 'main-detail__ref').get_text(strip=True),
'price':soup.find('span', {'itemprop':'price'}).get('content'),
'url':url
})
pd.DataFrame(data)
输出
name | brand | ref | price | url |
---|---|---|---|---|
Montre The Longines Legend Diver L3.774.4.30.2 | Longines | Référence : L3.774.4.30.2 | 2240 | https://www.frayssinet-joaillier.fr/fr/p/montre-the-longines-legend-diver-l37744302-bdc2 |