抓取特定邮政编码下的产品 URL
Scraping product URLs under specific ZIP Code
我正在尝试抓取邮政编码 08041 下的产品链接。我已经编写了代码来抓取没有邮政编码的产品,但不知道如何抓取和发送 08041?
下的产品请求
这是我的代码:
import requests
import random
import time
from bs4 import BeautifulSoup
import wget
import csv
from fp.fp import FreeProxy
def helloworld(url):
r = requests.get(url)
print ('Status',r.status_code)
#time.sleep(8)
soup = BeautifulSoup(r.content,'html.parser')
post = soup.find_all('a',"name")
for href in post:
if ( href.get('href')[1] == 'p'):
href = href.get('href')
print (href)
def page_counter():
url1 = "https://soysuper.com/c/aperitivos#products"
print (url1,'\n')
helloworld(url1)
page_counter()
您可以使用后端端点来模拟具有给定邮政编码的请求。
注意:cookie 是硬编码的,但有效期为一年。
方法如下:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Cookie": "soysuper=eyJjYXJ0IjoiNjA2NWNkMzg5ZDI5YzkwNDU1NjI3MzYzIiwiZXhwaXJlcyI6MTY0ODg0MTMzOSwib3JpZCI6IkM2NzgwOUYyLTkyRUYtMTFFQi04NjNELTgzMTBCMUUwMTM2NiIsInNtIjoiIiwidXVpZCI6IkIwQjYxQzRFLTkyRUYtMTFFQi05MjRCLTA5MTFCMUUwMTM2NiIsIndoIjpbIjU0MDQ5MjEwMDk1Y2ZhNTQ2YzAwMDAwMCIsIjRmZjMwZTZhNTgzMmU0OGIwMjAwMDAwMCIsIjU5Y2JhZmE2OWRkNGU0M2JmMzIwODM0MiIsIjRmMzEyNzU4ZTNjNmIzMDAzMjAwMDAwMCIsIjVhMTZmNjdhMjUwOGMxNGFiMzE0OTY4MyIsIjYwMjQxNTEzNzIyZDZhNTZkNDZlMjhmNyIsIjRmZjMwZTJkYzI3ZTk1NTkwMjAwMDAwMSIsIjU5ZjcxYTZlNjI4YWIwN2UyYjJjZmJhMSIsIjU5Y2JhZjNjOWRkNGU0M2JmMzIwODM0MSIsIjVhMGU0NDFhNTNjOTdiM2UxNDYyOGEzNiIsIjRmMmJiZmI3ZWJjYjU1OGM3YjAwMDAwMCIsIjYwNDExZjJlNzIyZDZhMTEyZDVjYTNlYiIsIjViMWZmZjAyNzI1YTYxNzBjOTIxMjc0MSIsIjVlNzk2NWUwZDc5MTg3MGU0NTA1MGMwMCIsIjVkMTI0NDQ2OWRkNGU0NGFkMDU3MmMxMSJdLCJ6aXAiOiIwODA0MSJ9--166849121eece159a6fdb0c0fe8341032321d9b1;"
}
with requests.Session() as connection:
r = connection.get("https://soysuper.com/supermarket?zipcode=08041", headers=headers)
headers["Request-Id"] = r.headers["Next-Request-Id"]
headers["Referer"] = "https://soysuper.com/c/aperitivos"
products_data = connection.get("https://soysuper.com/c/aperitivos?products=1&page=1", headers=headers).json()
print(products_data["products"]["total"])
输出:08041 邮政编码的产品总数。
2923
您实际得到的是 JSON
,其中包含给定页面的所有产品数据。这是它在“网络”选项卡中的样子。
请注意 pager
键。用它来“分页”API 并获取更多产品信息。
我正在尝试抓取邮政编码 08041 下的产品链接。我已经编写了代码来抓取没有邮政编码的产品,但不知道如何抓取和发送 08041?
下的产品请求这是我的代码:
import requests
import random
import time
from bs4 import BeautifulSoup
import wget
import csv
from fp.fp import FreeProxy
def helloworld(url):
r = requests.get(url)
print ('Status',r.status_code)
#time.sleep(8)
soup = BeautifulSoup(r.content,'html.parser')
post = soup.find_all('a',"name")
for href in post:
if ( href.get('href')[1] == 'p'):
href = href.get('href')
print (href)
def page_counter():
url1 = "https://soysuper.com/c/aperitivos#products"
print (url1,'\n')
helloworld(url1)
page_counter()
您可以使用后端端点来模拟具有给定邮政编码的请求。
注意:cookie 是硬编码的,但有效期为一年。
方法如下:
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.105 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Cookie": "soysuper=eyJjYXJ0IjoiNjA2NWNkMzg5ZDI5YzkwNDU1NjI3MzYzIiwiZXhwaXJlcyI6MTY0ODg0MTMzOSwib3JpZCI6IkM2NzgwOUYyLTkyRUYtMTFFQi04NjNELTgzMTBCMUUwMTM2NiIsInNtIjoiIiwidXVpZCI6IkIwQjYxQzRFLTkyRUYtMTFFQi05MjRCLTA5MTFCMUUwMTM2NiIsIndoIjpbIjU0MDQ5MjEwMDk1Y2ZhNTQ2YzAwMDAwMCIsIjRmZjMwZTZhNTgzMmU0OGIwMjAwMDAwMCIsIjU5Y2JhZmE2OWRkNGU0M2JmMzIwODM0MiIsIjRmMzEyNzU4ZTNjNmIzMDAzMjAwMDAwMCIsIjVhMTZmNjdhMjUwOGMxNGFiMzE0OTY4MyIsIjYwMjQxNTEzNzIyZDZhNTZkNDZlMjhmNyIsIjRmZjMwZTJkYzI3ZTk1NTkwMjAwMDAwMSIsIjU5ZjcxYTZlNjI4YWIwN2UyYjJjZmJhMSIsIjU5Y2JhZjNjOWRkNGU0M2JmMzIwODM0MSIsIjVhMGU0NDFhNTNjOTdiM2UxNDYyOGEzNiIsIjRmMmJiZmI3ZWJjYjU1OGM3YjAwMDAwMCIsIjYwNDExZjJlNzIyZDZhMTEyZDVjYTNlYiIsIjViMWZmZjAyNzI1YTYxNzBjOTIxMjc0MSIsIjVlNzk2NWUwZDc5MTg3MGU0NTA1MGMwMCIsIjVkMTI0NDQ2OWRkNGU0NGFkMDU3MmMxMSJdLCJ6aXAiOiIwODA0MSJ9--166849121eece159a6fdb0c0fe8341032321d9b1;"
}
with requests.Session() as connection:
r = connection.get("https://soysuper.com/supermarket?zipcode=08041", headers=headers)
headers["Request-Id"] = r.headers["Next-Request-Id"]
headers["Referer"] = "https://soysuper.com/c/aperitivos"
products_data = connection.get("https://soysuper.com/c/aperitivos?products=1&page=1", headers=headers).json()
print(products_data["products"]["total"])
输出:08041 邮政编码的产品总数。
2923
您实际得到的是 JSON
,其中包含给定页面的所有产品数据。这是它在“网络”选项卡中的样子。
请注意 pager
键。用它来“分页”API 并获取更多产品信息。