亚马逊是如何根据位置抓取数据的?
How is data scraping based on location in Amazon?
每当我想抓取 amazon.com 时,我都会失败。因为产品信息根据 amazon.com
中的位置而变化
本次变更信息如下;
- 1-价格
- 2-运费
- 3-海关费用
- 4-运输状态
用selenium改变位置很简单,但是处理速度很慢。所以这就是为什么我需要使用 scrapy 或 requests 进行抓取。
但是,虽然我在浏览器中模仿cookies和headers,但是amazon.com不允许我改变位置。
有两个大问题。
- 有一个名为“ubid-main”的数据,我无法导出副本
这个数据。这是没有数据的亚马逊。它不允许更改
位置。
- 虽然我对header数据也做了同样的处理,但还是有区别的
传出数据之间。示例:我在中使用完全相同的 header
浏览器。但在浏览器中 Content-Type 变为 json,但是
在我编写的代码中,它是 text / html;字符集 = UTF-8 going.
非常有趣的是,没有关于这个主题的信息。你不能location-oriented爬到世界第一的购物网站。
请知道这个题目答案的人赐教。
如果有像scrapy或者requests这样的解决方案就足够了。
说真的,我已经1年没解决这个问题了。
import requests
from lxml import etree
from random import choice
from urllib3.exceptions import InsecureRequestWarning
import urllib.parse
import urllib3.request
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def location():
headersdelivery = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'content-type':'application/x-www-form-urlencoded',
'accept':'text/html,*/*',
'x-requested-with':'XMLHttpRequest',
'contenttype':'application/x-www-form-urlencoded;charset=utf-8',
'origin':'https://www.amazon.com',
'sec-fetch-site':'same-origin',
'sec-fetch-mode':'cors',
'sec-fetch-dest':'empty',
'referer':'https://www.amazon.com/',
'accept-encoding':'gzip, deflate, br',
'accept-language':'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7'
}
payload = {
'locationType':'LOCATION_INPUT',
'zipCode':'34249',
'storeContext':'generic',
'deviceType':'web',
'pageType':'Gateway',
'actionSource':'glow',
'almBrandId':'undefined'}
sessionid = requests.session()
url = "https://www.amazon.com/gp/delivery/ajax/address-change.html"
ulkesecmereq = sessionid.post(url, headers=headersdelivery, data=payload,verify=False)
return sessionid
def response(locationsession):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'}
postdata = {
'storeContext':'generic',
'pageType':'Gateway'
}
req = locationsession.post("https://www.amazon.com/gp/glow/get-location-label.html",headers=headers, data=postdata, verify=False)
print(req.content)
locationsession = location()
response(locationsession)
我在 headers 中看到 CSRF 令牌 (anti-csrftoken-a2z),您在位置请求中错过了它,并且错过了对该位置的其他请求 (https://www.amazon.co.uk/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal)。您应该像在浏览器中一样实现所有请求。
Chrome中的简单示例:
Chrome -> devtools -> network -> XHR
copy as curl
在此处复制并转换为请求库(https://curl.trillworks.com/)。
首先,您应该从基础亚马逊页面获取令牌 anti-csrftoken-a2z
:
使用特定 User-Agent 向 www.amazon.com
发出请求:Mozilla ...
通过 XPATH 选择器获取 JSON 数据:
//span[@id='nav-global-location-data-modal-action']/@data-a-modal
来自此选择器的 JSON 示例:
{
"width": 375,
"closeButton": "false",
"popoverLabel": "Choose your location",
"ajaxHeaders": {
"anti-csrftoken-a2z": "ajaxHeaders >> anti-csrftoken-a2z"
},
"name": "glow-modal",
"url": "/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal",
"footer": "<span class=\"a-declarative\" data-action=\"a-popover-close\" data-a-popover-close=\"{}\"><span class=\"a-button a-button-primary\"><span class=\"a-button-inner\"><button name=\"glowDoneButton\" class=\"a-button-text\" type=\"button\">Done</button></span></span></span>",
"header": "Choose your location"
}
- 对下一个请求进行headers:
headers = {
"anti-csrftoken-a2z": `gMDCYRgjYFVWvjfmU70/qMURqYh7kAko11WlenYAAAAMAAAAAGGokFZyYXcAAAAA`,
"user-agent": "Mozila ..."
}
向 link 提出请求:https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal
使用步骤 2 中的 headers 和步骤 1 中的响应 cookie。
从响应中提取 CSRF_TOKEN
:
正则表达式:'CSRF_TOKEN : "(.+?)"'
使headers到下一个请求:
headers = {
"anti-csrftoken-a2z": "CSRF token from step 4",
"user-agent": "Mozila ..."
}
- 向
https://www.amazon.com/gp/delivery/ajax/address-change.html
发出 POST 请求
使用表单数据:
{
"locationType": "LOCATION_INPUT",
"zipCode": "zip-code",
"storeContext": "generic",
"deviceType": "web",
"pageType": "Gateway",
"actionSource": "glow",
"almBrandId": "undefined",
}
使用第 5 步的 headers 和第 3 步的响应 cookie。
如果所有文件你应该得到这样的回应:
{
'isValidAddress': 1,
'isTransitOutOfAis': 0,
'address': {'locationType': 'LOCATION_INPUT', 'district': None,
'zipCode': '30322', 'addressId': None, 'isDefaultShippingAddress': 'false', 'obfuscatedId': None, 'isAccountAddress': 'false', 'state': 'GA',
'countryCode': 'US', 'addressLabel': None,
'city': 'ATLANTA', 'addressLine1': None}, 'sembuUpdated': 1
}
- 保存第 6 步的响应 cookie 并将它们用于进一步的请求
Python 具有所有逻辑的脚本:
import json
import requests
from parsel import Selector
AMAZON_US_URL = "https://www.amazon.com/"
AMAZON_ADDRESS_CHANGE_URL = (
"https://www.amazon.com/gp/delivery/ajax/address-change.html"
)
AMAZON_CSRF_TOKEN_URL = (
"https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop"
"&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
)
DEFAULT_USER_AGENT = (
"Mozilla/5.0 ..."
)
DEFAULT_REQUEST_HEADERS = {"Accept-Language": "en", "User-Agent": DEFAULT_USER_AGENT}
def get_amazon_content(start_url: str, cookies: dict = None) -> tuple:
response = requests.get(
url=start_url, headers=DEFAULT_REQUEST_HEADERS, cookies=cookies
)
response.raise_for_status()
return Selector(text=response.text), response.cookies
def get_ajax_token(content: Selector):
data = content.xpath(
"//span[@id='nav-global-location-data-modal-action']/@data-a-modal"
).get()
if not data:
raise ValueError("Invalid page content")
json_data = json.loads(data)
return json_data["ajaxHeaders"]["anti-csrftoken-a2z"]
def get_session_id(content: Selector):
session_id = content.re_first(r'session: \{id: "(.+?)"')
if not session_id:
raise ValueError("Session id not found")
return session_id
def get_token(content: Selector):
csrf_token = content.re_first(r'CSRF_TOKEN : "(.+?)"')
if not csrf_token:
raise ValueError("CSRF token not found")
return csrf_token
def send_change_location_request(zip_code: str, headers: dict, cookies: dict):
response = requests.post(
url=AMAZON_ADDRESS_CHANGE_URL,
data={
"locationType": "LOCATION_INPUT",
"zipCode": zip_code,
"storeContext": "generic",
"deviceType": "web",
"pageType": "Gateway",
"actionSource": "glow",
"almBrandId": "undefined",
},
headers=headers,
cookies=cookies,
)
assert response.json()["isValidAddress"], "Invalid change response"
return response.cookies
def get_session_cookies(zip_code: str):
response = requests.get(url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS)
content = Selector(text=response.text)
headers = {
"anti-csrftoken-a2z": get_ajax_token(content=content),
"user-agent": DEFAULT_USER_AGENT,
}
response = requests.get(
url=AMAZON_CSRF_TOKEN_URL, headers=headers, cookies=response.cookies
)
content = Selector(text=response.text)
headers = {
"anti-csrftoken-a2z": get_token(content=content),
"user-agent": DEFAULT_USER_AGENT,
}
send_change_location_request(
zip_code=zip_code, headers=headers, cookies=dict(response.cookies)
)
# Verify that location changed correctly.
response = requests.get(
url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS, cookies=response.cookies
)
content = Selector(text=response.text)
location_label = content.css("span#glow-ingress-line2::text").get().strip()
assert zip_code in location_label
if __name__ == "__main__":
get_session_cookies(zip_code="30322")
还有,使用Scrapy框架的类似逻辑:
from http.cookies import SimpleCookie
from scrapy import FormRequest, Request, Spider
from scrapy.http import HtmlResponse
class AmazonSessionSpider(Spider):
"""
Amazon spider for extracting location cookies.
"""
name = "amazon.com:location-session"
address_change_endpoint = "/gp/delivery/ajax/address-change.html"
csrf_token_endpoint = (
"/gp/glow/get-address-selections.html?deviceType=desktop"
"&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
)
countries_base_urls = {
"US": "https://www.amazon.com",
"GB": "https://www.amazon.co.uk",
"DE": "https://www.amazon.de",
"ES": "https://www.amazon.es",
}
default_headers = {
"sec-fetch-site": "none",
"sec-fetch-dest": "document",
"accept-language": "ru-RU,ru;q=0.9",
"connection": "close",
}
def __init__(self, country: str, zip_code: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.country = country
self.zip_code = zip_code
def start_requests(self):
"""
Make start request to main Amazon country page.
"""
request = Request(
url=self.countries_base_urls[self.country],
headers=self.default_headers,
callback=self.parse_ajax_token,
)
yield request
def parse_ajax_token(self, response: HtmlResponse):
"""
Parse ajax token from response.
"""
yield response.request.replace(
url=self.countries_base_urls[self.country] + self.csrf_token_endpoint,
headers={
"anti-csrftoken-a2z": self.get_ajax_token(response=response),
**self.default_headers,
},
callback=self.parse_csrf_token,
)
def parse_csrf_token(self, response: HtmlResponse):
"""
Parse CSRF token from response and make request to change Amazon location.
"""
yield FormRequest(
method="POST",
url=self.countries_base_urls[self.country] + self.address_change_endpoint,
formdata={
"locationType": "LOCATION_INPUT",
"zipCode": self.zip_code,
"storeContext": "generic",
"deviceType": "web",
"pageType": "Gateway",
"actionSource": "glow",
"almBrandId": "undefined",
},
headers={
"anti-csrftoken-a2z": self.get_csrf_token(response=response),
**self.default_headers,
},
callback=self.parse_session_cookies,
)
def parse_session_cookies(self, response: HtmlResponse) -> dict:
"""
Return cookies dict if location changed successfully.
"""
json_data = response.json()
if not json_data.get("isValidAddress"):
return {}
return self.extract_response_cookies(response=response)
@staticmethod
def get_ajax_token(response: HtmlResponse) -> str:
"""
Extract ajax token from response.
"""
data = response.xpath("//input[@id='glowValidationToken']/@value").get()
if not data:
raise ValueError("Invalid page content")
return data
@staticmethod
def get_csrf_token(response: HtmlResponse) -> str:
"""
Extract CSRF token from response.
"""
csrf_token = response.css("script").re_first(r'CSRF_TOKEN : "(.+?)"')
if not csrf_token:
raise ValueError("CSRF token not found")
return csrf_token
@staticmethod
def extract_response_cookies(response: HtmlResponse) -> dict:
"""
Extract cookies from response object
and return it in valid format.
"""
cookies = {}
cookie_headers = response.headers.getlist("Set-Cookie", [])
for cookie_str in cookie_headers:
cookie = SimpleCookie()
cookie.load(cookie_str.decode("utf-8"))
for key, raw_value in cookie.items():
cookies[key] = raw_value.value
return cookies
Shell 命令:
scrapy crawl amazon.com:location-session -a country=US -a zip_code=30332
每当我想抓取 amazon.com 时,我都会失败。因为产品信息根据 amazon.com
中的位置而变化本次变更信息如下;
- 1-价格
- 2-运费
- 3-海关费用
- 4-运输状态
用selenium改变位置很简单,但是处理速度很慢。所以这就是为什么我需要使用 scrapy 或 requests 进行抓取。
但是,虽然我在浏览器中模仿cookies和headers,但是amazon.com不允许我改变位置。
有两个大问题。
- 有一个名为“ubid-main”的数据,我无法导出副本 这个数据。这是没有数据的亚马逊。它不允许更改 位置。
- 虽然我对header数据也做了同样的处理,但还是有区别的 传出数据之间。示例:我在中使用完全相同的 header 浏览器。但在浏览器中 Content-Type 变为 json,但是 在我编写的代码中,它是 text / html;字符集 = UTF-8 going.
非常有趣的是,没有关于这个主题的信息。你不能location-oriented爬到世界第一的购物网站。
请知道这个题目答案的人赐教。 如果有像scrapy或者requests这样的解决方案就足够了。 说真的,我已经1年没解决这个问题了。
import requests
from lxml import etree
from random import choice
from urllib3.exceptions import InsecureRequestWarning
import urllib.parse
import urllib3.request
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
def location():
headersdelivery = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'content-type':'application/x-www-form-urlencoded',
'accept':'text/html,*/*',
'x-requested-with':'XMLHttpRequest',
'contenttype':'application/x-www-form-urlencoded;charset=utf-8',
'origin':'https://www.amazon.com',
'sec-fetch-site':'same-origin',
'sec-fetch-mode':'cors',
'sec-fetch-dest':'empty',
'referer':'https://www.amazon.com/',
'accept-encoding':'gzip, deflate, br',
'accept-language':'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7'
}
payload = {
'locationType':'LOCATION_INPUT',
'zipCode':'34249',
'storeContext':'generic',
'deviceType':'web',
'pageType':'Gateway',
'actionSource':'glow',
'almBrandId':'undefined'}
sessionid = requests.session()
url = "https://www.amazon.com/gp/delivery/ajax/address-change.html"
ulkesecmereq = sessionid.post(url, headers=headersdelivery, data=payload,verify=False)
return sessionid
def response(locationsession):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'TE': 'Trailers'}
postdata = {
'storeContext':'generic',
'pageType':'Gateway'
}
req = locationsession.post("https://www.amazon.com/gp/glow/get-location-label.html",headers=headers, data=postdata, verify=False)
print(req.content)
locationsession = location()
response(locationsession)
我在 headers 中看到 CSRF 令牌 (anti-csrftoken-a2z),您在位置请求中错过了它,并且错过了对该位置的其他请求 (https://www.amazon.co.uk/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal)。您应该像在浏览器中一样实现所有请求。
Chrome中的简单示例:
Chrome -> devtools -> network -> XHR
copy as curl
在此处复制并转换为请求库(https://curl.trillworks.com/)。
首先,您应该从基础亚马逊页面获取令牌 anti-csrftoken-a2z
:
使用特定 User-Agent 向
www.amazon.com
发出请求:Mozilla ...
通过 XPATH 选择器获取 JSON 数据:
//span[@id='nav-global-location-data-modal-action']/@data-a-modal
来自此选择器的 JSON 示例:
{
"width": 375,
"closeButton": "false",
"popoverLabel": "Choose your location",
"ajaxHeaders": {
"anti-csrftoken-a2z": "ajaxHeaders >> anti-csrftoken-a2z"
},
"name": "glow-modal",
"url": "/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal",
"footer": "<span class=\"a-declarative\" data-action=\"a-popover-close\" data-a-popover-close=\"{}\"><span class=\"a-button a-button-primary\"><span class=\"a-button-inner\"><button name=\"glowDoneButton\" class=\"a-button-text\" type=\"button\">Done</button></span></span></span>",
"header": "Choose your location"
}
- 对下一个请求进行headers:
headers = {
"anti-csrftoken-a2z": `gMDCYRgjYFVWvjfmU70/qMURqYh7kAko11WlenYAAAAMAAAAAGGokFZyYXcAAAAA`,
"user-agent": "Mozila ..."
}
向 link 提出请求:https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal 使用步骤 2 中的 headers 和步骤 1 中的响应 cookie。
从响应中提取
CSRF_TOKEN
: 正则表达式:'CSRF_TOKEN : "(.+?)"'
使headers到下一个请求:
headers = {
"anti-csrftoken-a2z": "CSRF token from step 4",
"user-agent": "Mozila ..."
}
- 向
https://www.amazon.com/gp/delivery/ajax/address-change.html
发出 POST 请求 使用表单数据:
{
"locationType": "LOCATION_INPUT",
"zipCode": "zip-code",
"storeContext": "generic",
"deviceType": "web",
"pageType": "Gateway",
"actionSource": "glow",
"almBrandId": "undefined",
}
使用第 5 步的 headers 和第 3 步的响应 cookie。
如果所有文件你应该得到这样的回应:
{
'isValidAddress': 1,
'isTransitOutOfAis': 0,
'address': {'locationType': 'LOCATION_INPUT', 'district': None,
'zipCode': '30322', 'addressId': None, 'isDefaultShippingAddress': 'false', 'obfuscatedId': None, 'isAccountAddress': 'false', 'state': 'GA',
'countryCode': 'US', 'addressLabel': None,
'city': 'ATLANTA', 'addressLine1': None}, 'sembuUpdated': 1
}
- 保存第 6 步的响应 cookie 并将它们用于进一步的请求
Python 具有所有逻辑的脚本:
import json
import requests
from parsel import Selector
AMAZON_US_URL = "https://www.amazon.com/"
AMAZON_ADDRESS_CHANGE_URL = (
"https://www.amazon.com/gp/delivery/ajax/address-change.html"
)
AMAZON_CSRF_TOKEN_URL = (
"https://www.amazon.com/gp/glow/get-address-selections.html?deviceType=desktop"
"&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
)
DEFAULT_USER_AGENT = (
"Mozilla/5.0 ..."
)
DEFAULT_REQUEST_HEADERS = {"Accept-Language": "en", "User-Agent": DEFAULT_USER_AGENT}
def get_amazon_content(start_url: str, cookies: dict = None) -> tuple:
response = requests.get(
url=start_url, headers=DEFAULT_REQUEST_HEADERS, cookies=cookies
)
response.raise_for_status()
return Selector(text=response.text), response.cookies
def get_ajax_token(content: Selector):
data = content.xpath(
"//span[@id='nav-global-location-data-modal-action']/@data-a-modal"
).get()
if not data:
raise ValueError("Invalid page content")
json_data = json.loads(data)
return json_data["ajaxHeaders"]["anti-csrftoken-a2z"]
def get_session_id(content: Selector):
session_id = content.re_first(r'session: \{id: "(.+?)"')
if not session_id:
raise ValueError("Session id not found")
return session_id
def get_token(content: Selector):
csrf_token = content.re_first(r'CSRF_TOKEN : "(.+?)"')
if not csrf_token:
raise ValueError("CSRF token not found")
return csrf_token
def send_change_location_request(zip_code: str, headers: dict, cookies: dict):
response = requests.post(
url=AMAZON_ADDRESS_CHANGE_URL,
data={
"locationType": "LOCATION_INPUT",
"zipCode": zip_code,
"storeContext": "generic",
"deviceType": "web",
"pageType": "Gateway",
"actionSource": "glow",
"almBrandId": "undefined",
},
headers=headers,
cookies=cookies,
)
assert response.json()["isValidAddress"], "Invalid change response"
return response.cookies
def get_session_cookies(zip_code: str):
response = requests.get(url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS)
content = Selector(text=response.text)
headers = {
"anti-csrftoken-a2z": get_ajax_token(content=content),
"user-agent": DEFAULT_USER_AGENT,
}
response = requests.get(
url=AMAZON_CSRF_TOKEN_URL, headers=headers, cookies=response.cookies
)
content = Selector(text=response.text)
headers = {
"anti-csrftoken-a2z": get_token(content=content),
"user-agent": DEFAULT_USER_AGENT,
}
send_change_location_request(
zip_code=zip_code, headers=headers, cookies=dict(response.cookies)
)
# Verify that location changed correctly.
response = requests.get(
url=AMAZON_US_URL, headers=DEFAULT_REQUEST_HEADERS, cookies=response.cookies
)
content = Selector(text=response.text)
location_label = content.css("span#glow-ingress-line2::text").get().strip()
assert zip_code in location_label
if __name__ == "__main__":
get_session_cookies(zip_code="30322")
还有,使用Scrapy框架的类似逻辑:
from http.cookies import SimpleCookie
from scrapy import FormRequest, Request, Spider
from scrapy.http import HtmlResponse
class AmazonSessionSpider(Spider):
"""
Amazon spider for extracting location cookies.
"""
name = "amazon.com:location-session"
address_change_endpoint = "/gp/delivery/ajax/address-change.html"
csrf_token_endpoint = (
"/gp/glow/get-address-selections.html?deviceType=desktop"
"&pageType=Gateway&storeContext=NoStoreName&actionSource=desktop-modal"
)
countries_base_urls = {
"US": "https://www.amazon.com",
"GB": "https://www.amazon.co.uk",
"DE": "https://www.amazon.de",
"ES": "https://www.amazon.es",
}
default_headers = {
"sec-fetch-site": "none",
"sec-fetch-dest": "document",
"accept-language": "ru-RU,ru;q=0.9",
"connection": "close",
}
def __init__(self, country: str, zip_code: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.country = country
self.zip_code = zip_code
def start_requests(self):
"""
Make start request to main Amazon country page.
"""
request = Request(
url=self.countries_base_urls[self.country],
headers=self.default_headers,
callback=self.parse_ajax_token,
)
yield request
def parse_ajax_token(self, response: HtmlResponse):
"""
Parse ajax token from response.
"""
yield response.request.replace(
url=self.countries_base_urls[self.country] + self.csrf_token_endpoint,
headers={
"anti-csrftoken-a2z": self.get_ajax_token(response=response),
**self.default_headers,
},
callback=self.parse_csrf_token,
)
def parse_csrf_token(self, response: HtmlResponse):
"""
Parse CSRF token from response and make request to change Amazon location.
"""
yield FormRequest(
method="POST",
url=self.countries_base_urls[self.country] + self.address_change_endpoint,
formdata={
"locationType": "LOCATION_INPUT",
"zipCode": self.zip_code,
"storeContext": "generic",
"deviceType": "web",
"pageType": "Gateway",
"actionSource": "glow",
"almBrandId": "undefined",
},
headers={
"anti-csrftoken-a2z": self.get_csrf_token(response=response),
**self.default_headers,
},
callback=self.parse_session_cookies,
)
def parse_session_cookies(self, response: HtmlResponse) -> dict:
"""
Return cookies dict if location changed successfully.
"""
json_data = response.json()
if not json_data.get("isValidAddress"):
return {}
return self.extract_response_cookies(response=response)
@staticmethod
def get_ajax_token(response: HtmlResponse) -> str:
"""
Extract ajax token from response.
"""
data = response.xpath("//input[@id='glowValidationToken']/@value").get()
if not data:
raise ValueError("Invalid page content")
return data
@staticmethod
def get_csrf_token(response: HtmlResponse) -> str:
"""
Extract CSRF token from response.
"""
csrf_token = response.css("script").re_first(r'CSRF_TOKEN : "(.+?)"')
if not csrf_token:
raise ValueError("CSRF token not found")
return csrf_token
@staticmethod
def extract_response_cookies(response: HtmlResponse) -> dict:
"""
Extract cookies from response object
and return it in valid format.
"""
cookies = {}
cookie_headers = response.headers.getlist("Set-Cookie", [])
for cookie_str in cookie_headers:
cookie = SimpleCookie()
cookie.load(cookie_str.decode("utf-8"))
for key, raw_value in cookie.items():
cookies[key] = raw_value.value
return cookies
Shell 命令:
scrapy crawl amazon.com:location-session -a country=US -a zip_code=30332