从没有明确组织方式的网站上抓取位置 url
Scraping location from website that has no clear way to organizing url
所以我一直在抓取加拿大各地不同零售商的位置数据,我正在做的一些 research.The 工作是帮助了解一些行业是如何受到 covid.So 影响的商店定位器页面非常简单,只有一个 link,我可以在 python 中使用 lng 和 lat 等变量。但是,我遇到了一个网站,我无法弄清楚他们如何请求位置数据。这个零售商是 LCBO。商店定位器在所有页面的底部都有一个小选项,当输入位置时,它只是原始页面上的叠加层。这是 LCBO 的 link:https://www.lcbo.com/webapp/wcs/stores/servlet/en/lcbo
如果有人对我如何操作用于他们在 chrome 网络选项卡中的位置的 link 有任何建议,那就太好了。这似乎是我在做了几个大型零售商之后发现的最难的商店零售商。所以任何建议都会很棒。
我尝试过或尝试过的:
所以我使用了 postman 并发送了这个 link 的 curl 文件:https://www.lcbo.com/webapp/wcs/stores/servlet/AjaxStoreLocatorResultsView?catalogId=10051&langId=-1&storeId=10203&orderId=
在 postman 中,我尝试使用 langid 和 storeid。但是,我编辑的请求没有工作,甚至当我编辑(更改数字)参数时我也没有得到任何新信息。也许我没有输入合乎逻辑的数字,但是当我将所有内容加 1 时,没有任何新的事情发生。也许我的 curl link 是一个坏的,我错过了 url 这是请求位置的更好方法?
旁注,希望能在这里写一篇关于如何刮掉商店位置的大post,因为我是 GIS 学生,发现还有很多其他形式scraping 显示在示例中,但非常基于位置。但是,一次一个问题。
谢谢!
下面的代码是我之前处理这个问题的方法!
import requests
import json
import numpy as np
import csv
x = range(0,100)
row = []
for pages in x:
url = f"https://www.couche-tard.com/stores_new.php?lat=46.8257&lng=-71.2349&services=®ion=quebec&page={pages}"
payload={}
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.couche-tard.com/trouvez-votre-magasin?address=Qu%C3%A9bec,Quebec,Canada&lat=46.8257&lng=-71.2349',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'dnt': '1'
}
response = requests.request("GET", url, headers=headers, data=payload)
stores = json.loads(response.text)
tmp_row = []
for store in stores['stores'].values():
Match_address = store["address"]
Match_city = store["city"]
display_brand = store["display_brand"]
tmp_row.append([Match_address, Match_city, display_brand])
row.extend(tmp_row)
with open('couche.csv', mode='w', encoding='utf-8') as CSVFile:
writer = csv.writer(CSVFile, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow([
"address",
"city",
"display_brand",
])
writer.writerows(row)
通过执行以下步骤,我能够在 Postman 中获得回复:
- 在网站上使用商店定位器,打开网络选项卡
- 您看到 ajax 呼叫某个位置
- 右键单击请求并选择复制 --> 复制为 curl
- 打开邮递员并点击导入
- select 原始和过去的 curl 代码
- 运行 postman 中的请求
- 成功
有关 copy/pasting 请求的更多信息,请点击此处:https://dev.to/stuartcreed/how-to-copy-a-http-request-to-from-the-network-taboo-postman-5835
随机请求的卷曲:
curl 'https://www.lcbo.com/webapp/wcs/stores/servlet/AjaxStoreLocatorResultsView?catalogId=10051&langId=-1&storeId=10203&orderId=' \
-H 'authority: www.lcbo.com' \
-H 'sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"' \
-H 'accept: */*' \
-H 'x-requested-with: XMLHttpRequest' \
-H 'sec-ch-ua-mobile: ?1' \
-H 'user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36' \
-H 'content-type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'origin: https://www.lcbo.com' \
-H 'sec-fetch-site: same-origin' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://www.lcbo.com/webapp/wcs/stores/servlet/en/lcbo' \
-H 'accept-language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7,es;q=0.6,de;q=0.5,zh-TW;q=0.4,zh;q=0.3,fr;q=0.2' \
-H 'cookie: JSESSIONID=0000vBbIDBjsUM58mEmmzaCYUqn:ecomapp02; bv_sha=2019082000; BVBRANDID=50d350fa-9bf3-450a-8c14-c3d17414d578; BVBRANDSID=5034fc5b-874c-4b47-b20f-5fbb0efc4d8b; s_fid=5A14CCAB963E6B8B-26111870A9508B76; gpv_pagename=homepage; s_cc=true; kampyle_userid=4c09-32a1-9cad-0c04-c9a7-725a-8f95-67b0; __z_a=748269966287106654528710; languagepopupshown=true; lang=en; WC_latitude=52.3501568; WC_longitude=4.908646399999999; WC_SESSION_ESTABLISHED=true; WC_PERSISTENT=XgKKFMQkD5auLS4bGLnHygfywCvZNo4pPATxjd9d1dc%3D%3B2021-07-15+02%3A00%3A28.398_1626328799364-193523_10203_-1002%2C-1%2CCAD%2CYKP40yGPqEEbpHLRIgbiN7BNgljfv5jFfuiUZ5cHWuJ0yqBdGfRFN3lxsaSNTA106UezjIyXERst27mEguaTKA%3D%3D_10203; WC_AUTHENTICATION_-1002=-1002%2CetYHJunN2fPHnqWYb6iXk2V4thGGKMCP7yRvypmsoac%3D; WC_ACTIVEPOINTER=-1%2C10203; WC_stCity=toronto; WC_USERACTIVITY_-1002=-1002%2C10203%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C677837368%2CwuVMSWYcMPzSz0F5IG1h%2BK7SE3Ia5bb%2FmnWOhA1lwPFp78OQHUHJt3TMpAt%2BwkUbTggLMzFB75302vvSErzuY9QSsvEEUY%2FpKGkgAoIvKsy1KOxeh2J1Pxvx%2FURbA8RXNwI4KUUPl%2FyqCRJ6tkc0K%2F7mQ9I%2FT77nJ2cgeI9SYTKqp2xIopLduH5RLIkhCkSBxLLBGagA6EzayVj7%2FFkfDEYTl7Q84F4iiuTzLpld5HxiMDm32qKiizMSJdJDmcE1; WC_GENERIC_ACTIVITYDATA=[5248871728%3Atrue%3Afalse%3A0%3ABUQiX0jAISFBFLIbw5p%2FJHNVr%2FS4bfEq%2BkV%2FUnDS1NY%3D][com.ibm.commerce.context.ExternalCartContext|null][com.ibm.commerce.context.entitlement.EntitlementContext|10502%2610502%26null%26-2000%26null%26null%26null][com.ibm.commerce.store.facade.server.context.StoreGeoCodeContext|null%26null%26null%26null%26null%26null][com.ibm.commerce.catalog.businesscontext.CatalogContext|10051%26null%26false%26false%26false][CTXSETNAME|Store][com.ibm.commerce.context.base.BaseContext|10203%26-1002%26-1002%26-1][com.ibm.commerce.context.audit.AuditContext|1626328799364-193523][com.lcbo.lco.commerce.context.LCOContext|false][com.ibm.commerce.context.experiment.ExperimentContext|null][com.ibm.commerce.giftcenter.context.GiftCenterContext|null%26null%26null][com.ibm.commerce.context.globalization.GlobalizationContext|-1%26CAD%26-1%26CAD]; __zjc2339=5109266816; QueueITAccepted-SDFrts345E-V3_2020pandemic=EventId%3D2020pandemic%26QueueId%3D00000000-0000-0000-0000-000000000000%26RedirectType%3Ddisabled%26IssueTime%3D1626330141%26Hash%3D7657f416295433996cce6c372ea650f6be4cecb9b71bee059038766af32fa1b7; WC_CartOrderId_10203=; kampyleUserSession=1626330141718; kampyleUserSessionsCount=2; kampyleSessionPageCounter=1; kampyleUserPercentile=25.946945797403643; cd_user_id=17aa8d494936a8-0c00c250f0a3b5-34657601-232800-17aa8d49494866; pageLoadAverage=5%3A32; s_getNewRepeat=1626330158348-New; s_sq=lcboprod%3D%2526pid%253Dhomepage%2526pidt%253D1%2526oid%253D%25250A%252509%252509%252509%252509%252509%25250A%252509%252509%252509%252509%2526oidt%253D3%2526ot%253DSUBMIT' \
--data-raw 'fromPage=CategoryPage&features=&citypostalcode=toronto&latitude=&longitude=&productId=&ageChecked=%5Bobject+SubmitEvent%5D&isFavoriteStore=false&requesttype=ajax' \
--compressed
所以我一直在抓取加拿大各地不同零售商的位置数据,我正在做的一些 research.The 工作是帮助了解一些行业是如何受到 covid.So 影响的商店定位器页面非常简单,只有一个 link,我可以在 python 中使用 lng 和 lat 等变量。但是,我遇到了一个网站,我无法弄清楚他们如何请求位置数据。这个零售商是 LCBO。商店定位器在所有页面的底部都有一个小选项,当输入位置时,它只是原始页面上的叠加层。这是 LCBO 的 link:https://www.lcbo.com/webapp/wcs/stores/servlet/en/lcbo
如果有人对我如何操作用于他们在 chrome 网络选项卡中的位置的 link 有任何建议,那就太好了。这似乎是我在做了几个大型零售商之后发现的最难的商店零售商。所以任何建议都会很棒。
我尝试过或尝试过的: 所以我使用了 postman 并发送了这个 link 的 curl 文件:https://www.lcbo.com/webapp/wcs/stores/servlet/AjaxStoreLocatorResultsView?catalogId=10051&langId=-1&storeId=10203&orderId= 在 postman 中,我尝试使用 langid 和 storeid。但是,我编辑的请求没有工作,甚至当我编辑(更改数字)参数时我也没有得到任何新信息。也许我没有输入合乎逻辑的数字,但是当我将所有内容加 1 时,没有任何新的事情发生。也许我的 curl link 是一个坏的,我错过了 url 这是请求位置的更好方法?
旁注,希望能在这里写一篇关于如何刮掉商店位置的大post,因为我是 GIS 学生,发现还有很多其他形式scraping 显示在示例中,但非常基于位置。但是,一次一个问题。
谢谢!
下面的代码是我之前处理这个问题的方法!
import requests
import json
import numpy as np
import csv
x = range(0,100)
row = []
for pages in x:
url = f"https://www.couche-tard.com/stores_new.php?lat=46.8257&lng=-71.2349&services=®ion=quebec&page={pages}"
payload={}
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://www.couche-tard.com/trouvez-votre-magasin?address=Qu%C3%A9bec,Quebec,Canada&lat=46.8257&lng=-71.2349',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
'dnt': '1'
}
response = requests.request("GET", url, headers=headers, data=payload)
stores = json.loads(response.text)
tmp_row = []
for store in stores['stores'].values():
Match_address = store["address"]
Match_city = store["city"]
display_brand = store["display_brand"]
tmp_row.append([Match_address, Match_city, display_brand])
row.extend(tmp_row)
with open('couche.csv', mode='w', encoding='utf-8') as CSVFile:
writer = csv.writer(CSVFile, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow([
"address",
"city",
"display_brand",
])
writer.writerows(row)
通过执行以下步骤,我能够在 Postman 中获得回复:
- 在网站上使用商店定位器,打开网络选项卡
- 您看到 ajax 呼叫某个位置
- 右键单击请求并选择复制 --> 复制为 curl
- 打开邮递员并点击导入
- select 原始和过去的 curl 代码
- 运行 postman 中的请求
- 成功
有关 copy/pasting 请求的更多信息,请点击此处:https://dev.to/stuartcreed/how-to-copy-a-http-request-to-from-the-network-taboo-postman-5835
随机请求的卷曲:
curl 'https://www.lcbo.com/webapp/wcs/stores/servlet/AjaxStoreLocatorResultsView?catalogId=10051&langId=-1&storeId=10203&orderId=' \
-H 'authority: www.lcbo.com' \
-H 'sec-ch-ua: " Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"' \
-H 'accept: */*' \
-H 'x-requested-with: XMLHttpRequest' \
-H 'sec-ch-ua-mobile: ?1' \
-H 'user-agent: Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36' \
-H 'content-type: application/x-www-form-urlencoded; charset=UTF-8' \
-H 'origin: https://www.lcbo.com' \
-H 'sec-fetch-site: same-origin' \
-H 'sec-fetch-mode: cors' \
-H 'sec-fetch-dest: empty' \
-H 'referer: https://www.lcbo.com/webapp/wcs/stores/servlet/en/lcbo' \
-H 'accept-language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7,es;q=0.6,de;q=0.5,zh-TW;q=0.4,zh;q=0.3,fr;q=0.2' \
-H 'cookie: JSESSIONID=0000vBbIDBjsUM58mEmmzaCYUqn:ecomapp02; bv_sha=2019082000; BVBRANDID=50d350fa-9bf3-450a-8c14-c3d17414d578; BVBRANDSID=5034fc5b-874c-4b47-b20f-5fbb0efc4d8b; s_fid=5A14CCAB963E6B8B-26111870A9508B76; gpv_pagename=homepage; s_cc=true; kampyle_userid=4c09-32a1-9cad-0c04-c9a7-725a-8f95-67b0; __z_a=748269966287106654528710; languagepopupshown=true; lang=en; WC_latitude=52.3501568; WC_longitude=4.908646399999999; WC_SESSION_ESTABLISHED=true; WC_PERSISTENT=XgKKFMQkD5auLS4bGLnHygfywCvZNo4pPATxjd9d1dc%3D%3B2021-07-15+02%3A00%3A28.398_1626328799364-193523_10203_-1002%2C-1%2CCAD%2CYKP40yGPqEEbpHLRIgbiN7BNgljfv5jFfuiUZ5cHWuJ0yqBdGfRFN3lxsaSNTA106UezjIyXERst27mEguaTKA%3D%3D_10203; WC_AUTHENTICATION_-1002=-1002%2CetYHJunN2fPHnqWYb6iXk2V4thGGKMCP7yRvypmsoac%3D; WC_ACTIVEPOINTER=-1%2C10203; WC_stCity=toronto; WC_USERACTIVITY_-1002=-1002%2C10203%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C677837368%2CwuVMSWYcMPzSz0F5IG1h%2BK7SE3Ia5bb%2FmnWOhA1lwPFp78OQHUHJt3TMpAt%2BwkUbTggLMzFB75302vvSErzuY9QSsvEEUY%2FpKGkgAoIvKsy1KOxeh2J1Pxvx%2FURbA8RXNwI4KUUPl%2FyqCRJ6tkc0K%2F7mQ9I%2FT77nJ2cgeI9SYTKqp2xIopLduH5RLIkhCkSBxLLBGagA6EzayVj7%2FFkfDEYTl7Q84F4iiuTzLpld5HxiMDm32qKiizMSJdJDmcE1; WC_GENERIC_ACTIVITYDATA=[5248871728%3Atrue%3Afalse%3A0%3ABUQiX0jAISFBFLIbw5p%2FJHNVr%2FS4bfEq%2BkV%2FUnDS1NY%3D][com.ibm.commerce.context.ExternalCartContext|null][com.ibm.commerce.context.entitlement.EntitlementContext|10502%2610502%26null%26-2000%26null%26null%26null][com.ibm.commerce.store.facade.server.context.StoreGeoCodeContext|null%26null%26null%26null%26null%26null][com.ibm.commerce.catalog.businesscontext.CatalogContext|10051%26null%26false%26false%26false][CTXSETNAME|Store][com.ibm.commerce.context.base.BaseContext|10203%26-1002%26-1002%26-1][com.ibm.commerce.context.audit.AuditContext|1626328799364-193523][com.lcbo.lco.commerce.context.LCOContext|false][com.ibm.commerce.context.experiment.ExperimentContext|null][com.ibm.commerce.giftcenter.context.GiftCenterContext|null%26null%26null][com.ibm.commerce.context.globalization.GlobalizationContext|-1%26CAD%26-1%26CAD]; __zjc2339=5109266816; QueueITAccepted-SDFrts345E-V3_2020pandemic=EventId%3D2020pandemic%26QueueId%3D00000000-0000-0000-0000-000000000000%26RedirectType%3Ddisabled%26IssueTime%3D1626330141%26Hash%3D7657f416295433996cce6c372ea650f6be4cecb9b71bee059038766af32fa1b7; WC_CartOrderId_10203=; kampyleUserSession=1626330141718; kampyleUserSessionsCount=2; kampyleSessionPageCounter=1; kampyleUserPercentile=25.946945797403643; cd_user_id=17aa8d494936a8-0c00c250f0a3b5-34657601-232800-17aa8d49494866; pageLoadAverage=5%3A32; s_getNewRepeat=1626330158348-New; s_sq=lcboprod%3D%2526pid%253Dhomepage%2526pidt%253D1%2526oid%253D%25250A%252509%252509%252509%252509%252509%25250A%252509%252509%252509%252509%2526oidt%253D3%2526ot%253DSUBMIT' \
--data-raw 'fromPage=CategoryPage&features=&citypostalcode=toronto&latitude=&longitude=&productId=&ageChecked=%5Bobject+SubmitEvent%5D&isFavoriteStore=false&requesttype=ajax' \
--compressed