代码未提取页面上的所有产品 URL

Question

我下面的代码是为了提取页面列表中页面上的所有产品 url。我正在抓取的网站是 javascript 网站。我的代码在网站的所有其他产品类别页面上都能完美运行。

但是，在此页面上它只提取了 36 个产品，这是加载到页面上的产品数量。 pages 变量在列表中，因为我试图通过像这样遍历所有页面来提取产品 url

pages = ['https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen', 'https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen?p=2-', 'https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen?p=3-', 'https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen?p=4-', 'https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen?p=5-']

但是，如果我运行这样的代码，它仍然 returns 列表中有 36 个项目。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time

baseurl = "https://www.mrphome.com/"

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}

produrlslug = []

pages = ['https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen']
for page in pages:
    content = requests.get(page, headers=headers)
    soup = BeautifulSoup(content.content, "lxml")
    url = soup.findAll('a', class_='product-image quickview-enabled')

for item in url:
    produrlslug.append(item['href'])
print(len(produrlslug))

如有任何帮助，我们将不胜感激。

Answer 1

缩进迭代 url 的第二个 for 循环解决了问题。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import time

baseurl = "https://www.mrphome.com/"

headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
}

produrlslug = []

pages = ['https://www.mrphome.com/en_za/shop/kitchen-dining/shop-dining/table-linen']
for page in pages:
    content = requests.get(page, headers=headers)
    soup = BeautifulSoup(content.content, "lxml")
    url = soup.findAll('a', class_='product-image quickview-enabled')
    # indentation missing
    for item in url:
        produrlslug.append(item['href'])
print(len(produrlslug))

代码未提取页面上的所有产品 URL

Code not extracting all products URLs on a page

python

beautifulsoup

request

web-scraping