为什么我不能从 noon.com 获得浏览器中显示的完整 'href'
Why can't I get the complete 'href' as showing in browser from noon.com
这是我正在做的
import requests
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup
HEADERS = {
'authority': 'www.noon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document'
}
response = requests.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905',headers=HEADERS,stream=True)
soup = BeautifulSoup(response.content,'lxml')
results = soup.find_all("div", {"class" : "productContainer"})
result = results[0]
print("https://www.noon.com" + result.a.get('href'))
输出
https://www.noon.com/uae-en
但预期的输出应该是'https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f'
从浏览器可以看到
<div class="productContainer"><a class="sc-7vj7do-0 ftlAjW" href="/uae-en/product/N35521717A/p?o=f885efe0b6534e9f" id="productBox-N35521717A"><div class="kcs0h5-0 diNcmV grid" title="Samsung Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="e3js0d-1 efqIDW"><div class="productImage" data-qa-id="productImagePLP_Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="lazyload-wrapper"><div class="puv25r-0 hfEfTS"><div class="puv25r-2 hJKuPa"><img alt="Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE " src="https://a.nooncdn.com/t_desktop-pdp-v1/v1605814225/N35521717A_1.jpg"/></div></div></div></div><div class="e3js0d-2 dqjnoR"><div class="tagContainer"></div></div></div><div class="e3js0d-6 iKEZJh"><div class="e3js0d-7 jULUCI"><div class="e3js0d-10 cyUANN"><span class="e3js0d-11 gXshOX">Samsung</span>Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE </div></div><div class="e3js0d-8 jtiosv"><div class="sc-3751lm-0 hSumnU"><div class="sc-3751lm-1 eUJkVt large"><span class="currency">AED</span><strong>819.00</strong></div><div class="sc-3751lm-2 kWnsOk"><span class="oldPrice">AED<!-- --> <!-- -->859</span></div></div></div><div class="e3js0d-9 kDpjlW"><div class="e3js0d-12 gMFqig"><div class="u8zs36-0 kRPdZJ"><img alt="noon-express" height="20px" src="https://a.nooncdn.com/s/app/com/noon/images/fulfilment_express-en.png" width="80px"/></div></div></div></div></div></a></div>
发生了什么和重现步骤
网站似乎处理动态生成的内容。
在浏览器中打开website
开源代码 ctrl + u
搜索 class="productContainer"
你会看到 <a>
的 href
只包含 /uae-en
->这就是使用 requests
得到的结果
打开检查器ctrl+shift+i
检查你的<a>
你会发现动态添加的部分,如果你使用selenium.
最小示例
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
actions = ActionChains(browser)
browser.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905')
time.sleep(3)
element = browser.find_element_by_xpath("//div[contains(@class, 'productContainer')]/a")
actions.move_to_element(element).perform()
print(element.get_attribute('href'))
browser.close()
输出
https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f
https://www.noon.com/uae-en/product/N41247213A/p?o=ca38c8921770ea2a
https://www.noon.com/uae-en/product/N41247235A/p?o=c97b8bfdc0114cba
https://www.noon.com/uae-en/product/N39790555A/p?o=d7354e20a0bb00ad
https://www.noon.com/uae-en/product/N32046052A/p?o=faea2e69f38bbf6a
...
编辑
你不会通过 scraping 来源获得 requests
的信息,但有另一种方法。
您可以将 api 与 requests
一起使用并构建 link(您可以自定义的简单示例):
import requests
url = "https://www.noon.com/_svc/catalog/api/u/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
records = response.json()["hits"]
for record in records:
offer_code = record["offer_code"]
sku = record["sku"]
url = record["url"]
print(f"https://www.noon.com/uae-en/{url}/{sku}/p?o={offer_code}")
输出
https://www.noon.com/uae-en/galaxy-m31-dual-sim-blue-6gb-ram-128gb-4g-lte/N35521717A/p?o=f885efe0b6534e9f
https://www.noon.com/uae-en/iphone-12-pro-max-with-facetime-128gb-pacific-blue-5g-international-specs/N41247213A/p?o=ca38c8921770ea2a
https://www.noon.com/uae-en/iphone-12-pro-with-facetime-256gb-pacific-blue-5g-international-specs/N41247235A/p?o=cfab59c09cab747b
...
这是我正在做的
import requests
from requests.adapters import HTTPAdapter
from bs4 import BeautifulSoup
HEADERS = {
'authority': 'www.noon.com',
'pragma': 'no-cache',
'cache-control': 'no-cache',
'dnt': '1',
'upgrade-insecure-requests': '1',
'accept': '*/*',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-dest': 'document'
}
response = requests.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905',headers=HEADERS,stream=True)
soup = BeautifulSoup(response.content,'lxml')
results = soup.find_all("div", {"class" : "productContainer"})
result = results[0]
print("https://www.noon.com" + result.a.get('href'))
输出
https://www.noon.com/uae-en
但预期的输出应该是'https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f'
从浏览器可以看到
<div class="productContainer"><a class="sc-7vj7do-0 ftlAjW" href="/uae-en/product/N35521717A/p?o=f885efe0b6534e9f" id="productBox-N35521717A"><div class="kcs0h5-0 diNcmV grid" title="Samsung Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="e3js0d-1 efqIDW"><div class="productImage" data-qa-id="productImagePLP_Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE "><div class="lazyload-wrapper"><div class="puv25r-0 hfEfTS"><div class="puv25r-2 hJKuPa"><img alt="Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE " src="https://a.nooncdn.com/t_desktop-pdp-v1/v1605814225/N35521717A_1.jpg"/></div></div></div></div><div class="e3js0d-2 dqjnoR"><div class="tagContainer"></div></div></div><div class="e3js0d-6 iKEZJh"><div class="e3js0d-7 jULUCI"><div class="e3js0d-10 cyUANN"><span class="e3js0d-11 gXshOX">Samsung</span>Galaxy M31 Dual SIM Blue 6GB RAM 128GB 4G LTE </div></div><div class="e3js0d-8 jtiosv"><div class="sc-3751lm-0 hSumnU"><div class="sc-3751lm-1 eUJkVt large"><span class="currency">AED</span><strong>819.00</strong></div><div class="sc-3751lm-2 kWnsOk"><span class="oldPrice">AED<!-- --> <!-- -->859</span></div></div></div><div class="e3js0d-9 kDpjlW"><div class="e3js0d-12 gMFqig"><div class="u8zs36-0 kRPdZJ"><img alt="noon-express" height="20px" src="https://a.nooncdn.com/s/app/com/noon/images/fulfilment_express-en.png" width="80px"/></div></div></div></div></div></a></div>
发生了什么和重现步骤
网站似乎处理动态生成的内容。
在浏览器中打开website
开源代码
得到的结果ctrl + u
搜索class="productContainer"
你会看到<a>
的href
只包含/uae-en
->这就是使用requests
打开检查器
ctrl+shift+i
检查你的<a>
你会发现动态添加的部分,如果你使用selenium.
最小示例
import time
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
actions = ActionChains(browser)
browser.get('https://www.noon.com/uae-en/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905')
time.sleep(3)
element = browser.find_element_by_xpath("//div[contains(@class, 'productContainer')]/a")
actions.move_to_element(element).perform()
print(element.get_attribute('href'))
browser.close()
输出
https://www.noon.com/uae-en/product/N35521717A/p?o=f885efe0b6534e9f
https://www.noon.com/uae-en/product/N41247213A/p?o=ca38c8921770ea2a
https://www.noon.com/uae-en/product/N41247235A/p?o=c97b8bfdc0114cba
https://www.noon.com/uae-en/product/N39790555A/p?o=d7354e20a0bb00ad
https://www.noon.com/uae-en/product/N32046052A/p?o=faea2e69f38bbf6a
...
编辑
你不会通过 scraping 来源获得 requests
的信息,但有另一种方法。
您可以将 api 与 requests
一起使用并构建 link(您可以自定义的简单示例):
import requests
url = "https://www.noon.com/_svc/catalog/api/u/electronics-and-mobiles/mobiles-and-accessories/mobiles-20905"
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
response.raise_for_status()
records = response.json()["hits"]
for record in records:
offer_code = record["offer_code"]
sku = record["sku"]
url = record["url"]
print(f"https://www.noon.com/uae-en/{url}/{sku}/p?o={offer_code}")
输出
https://www.noon.com/uae-en/galaxy-m31-dual-sim-blue-6gb-ram-128gb-4g-lte/N35521717A/p?o=f885efe0b6534e9f
https://www.noon.com/uae-en/iphone-12-pro-max-with-facetime-128gb-pacific-blue-5g-international-specs/N41247213A/p?o=ca38c8921770ea2a
https://www.noon.com/uae-en/iphone-12-pro-with-facetime-256gb-pacific-blue-5g-international-specs/N41247235A/p?o=cfab59c09cab747b
...