如何使用 python 提取动态 html 中的总值?
How to extract total values in a dynamic html with python?
我可以在 html 中提取我需要的部分值,但我无法提取所有值。我怎样才能完全获得 python 中的值?
import time
import requests
!pip install beautifulsoup4
import bs4
!pip install lxml
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
output =[]
url = "https://m.pcone.com.tw/store/0670386?ref=d_item_store"
r = requests.get(url, headers = headers)
soup = bs4.BeautifulSoup(r.text,"lxml")
for product in soup.find_all("a",class_='product-list-item'):
productname = product.find("div",class_='name limit-2-line').get_text(strip=True)
productprice= product.select_one("span",class_='symbol-price').string
ordercount = product.find('span',class_="order_count").string[:-3] if product.find('span',class_="order_count")else None
print(f'{productname}:{productprice}:{ordercount}')
output.append([productname, productprice, ordercount])
df = pd.DataFrame(output, columns=['商品名稱', '價格', '購買人數'])
df.to_excel('松果-瑞昌.xlsx', index=False)
实际上,数据是由 javascript 从 api 调用 json 响应动态加载的,这就是 BeautifulSoup 无法获取数据的原因。 api 的最小工作解决方案仅使用如下请求:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
params={
'items_per_page': '20',
'null':'' ,
'page': '1',
'sortBy': 'default',
'sortDir': 'desc',
'store_id': '0670386'
}
output =[]
#url = "https://m.pcone.com.tw/store/0670386?ref=d_item_store"
api_url='https://www.pcone.com.tw/api/filterSearchTP'
for i in range(1,14):
params['total_pages'] = i
resp = requests.get(api_url, headers = headers,params=params).json()
for item in resp['products']:
productname=item['name']
productprice=item['msrp']
ordercount=item['order_count']
#print(ordercount)
output.append([productname, productprice, ordercount])
df = pd.DataFrame(output, columns=['商品名稱', '價格', '購買人數'])
df.to_excel('松果-瑞昌.xlsx', index=False)
我可以在 html 中提取我需要的部分值,但我无法提取所有值。我怎样才能完全获得 python 中的值?
import time
import requests
!pip install beautifulsoup4
import bs4
!pip install lxml
from bs4 import BeautifulSoup
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
output =[]
url = "https://m.pcone.com.tw/store/0670386?ref=d_item_store"
r = requests.get(url, headers = headers)
soup = bs4.BeautifulSoup(r.text,"lxml")
for product in soup.find_all("a",class_='product-list-item'):
productname = product.find("div",class_='name limit-2-line').get_text(strip=True)
productprice= product.select_one("span",class_='symbol-price').string
ordercount = product.find('span',class_="order_count").string[:-3] if product.find('span',class_="order_count")else None
print(f'{productname}:{productprice}:{ordercount}')
output.append([productname, productprice, ordercount])
df = pd.DataFrame(output, columns=['商品名稱', '價格', '購買人數'])
df.to_excel('松果-瑞昌.xlsx', index=False)
实际上,数据是由 javascript 从 api 调用 json 响应动态加载的,这就是 BeautifulSoup 无法获取数据的原因。 api 的最小工作解决方案仅使用如下请求:
import requests
import pandas as pd
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36'}
params={
'items_per_page': '20',
'null':'' ,
'page': '1',
'sortBy': 'default',
'sortDir': 'desc',
'store_id': '0670386'
}
output =[]
#url = "https://m.pcone.com.tw/store/0670386?ref=d_item_store"
api_url='https://www.pcone.com.tw/api/filterSearchTP'
for i in range(1,14):
params['total_pages'] = i
resp = requests.get(api_url, headers = headers,params=params).json()
for item in resp['products']:
productname=item['name']
productprice=item['msrp']
ordercount=item['order_count']
#print(ordercount)
output.append([productname, productprice, ordercount])
df = pd.DataFrame(output, columns=['商品名稱', '價格', '購買人數'])
df.to_excel('松果-瑞昌.xlsx', index=False)