我无法使用 for-in-loop 逐页获取值
I can't get values page by page with for-in-loop
如标题,我只能在第一页获取值,但我无法使用 for-in-loop 逐页获取值。
我检查了我的代码,但我仍然对它感到困惑。我怎样才能在每一页中获得这些值?
# Imports Required
!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
browser = webdriver.Chrome(executable_path='./chromedriver.exe')
wait = WebDriverWait(browser,5)
output = list()
for i in range(1,2):
browser.get("https://www.rakuten.com.tw/shop/watsons/product/?l-id=tw_shop_inshop_cat&p={}".format(i))
# Wait Until the product appear
wait.until(EC.presence_of_element_located((By.XPATH,"//div[@class='b-content b-fix-2lines']")))
# Get the products link
product_links = browser.find_elements(By.XPATH,"//div[@class='b-content b-fix-2lines']/b/a")
# Iterate over 'product_links' to get all the 'href' values
for link in (product_links):
print(link.get_attribute('href'))
browser.get(link.get_attribute('href'))
soup = BeautifulSoup(browser.page_source)
products =[]
product = {}
product['商品名稱'] = soup.find('div',class_="b-subarea b-layout-right shop-item ng-scope").h1.text.replace('\n','')
product['價錢'] = soup.find('strong',class_="b-text-xlarge qa-product-actualPrice").text.replace('\n','')
all_data=soup.find_all("div",class_="b-container-child")[2]
main_data=all_data.find_all("span")[-1]
product['購買次數'] = main_data.text
products.append(product)
print(products)
product_links = browser.find_elements(By.XPATH,"//div[@class='b-content b-fix-2lines']/b/a")
# Iterate over 'product_links' to get all the 'href' values
for link in (product_links):
print(link.get_attribute('href'))
browser.get(link.get_attribute('href'))
问题是,当您执行 browser.get()
时,它会使 product_links
引用的 HTML 元素无效,因为它不再存在于当前页面中。您应该将所有 'href'
属性放入一个数组中。一种方法是使用列表理解:
links = [link.get_attribute('href') for link in product_links]
现在您可以遍历 links
中的字符串来加载新页面。
话虽如此,您应该看看库 scrapy
,它可以为您完成很多繁重的工作。
如标题,我只能在第一页获取值,但我无法使用 for-in-loop 逐页获取值。 我检查了我的代码,但我仍然对它感到困惑。我怎样才能在每一页中获得这些值?
# Imports Required
!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup
browser = webdriver.Chrome(executable_path='./chromedriver.exe')
wait = WebDriverWait(browser,5)
output = list()
for i in range(1,2):
browser.get("https://www.rakuten.com.tw/shop/watsons/product/?l-id=tw_shop_inshop_cat&p={}".format(i))
# Wait Until the product appear
wait.until(EC.presence_of_element_located((By.XPATH,"//div[@class='b-content b-fix-2lines']")))
# Get the products link
product_links = browser.find_elements(By.XPATH,"//div[@class='b-content b-fix-2lines']/b/a")
# Iterate over 'product_links' to get all the 'href' values
for link in (product_links):
print(link.get_attribute('href'))
browser.get(link.get_attribute('href'))
soup = BeautifulSoup(browser.page_source)
products =[]
product = {}
product['商品名稱'] = soup.find('div',class_="b-subarea b-layout-right shop-item ng-scope").h1.text.replace('\n','')
product['價錢'] = soup.find('strong',class_="b-text-xlarge qa-product-actualPrice").text.replace('\n','')
all_data=soup.find_all("div",class_="b-container-child")[2]
main_data=all_data.find_all("span")[-1]
product['購買次數'] = main_data.text
products.append(product)
print(products)
product_links = browser.find_elements(By.XPATH,"//div[@class='b-content b-fix-2lines']/b/a")
# Iterate over 'product_links' to get all the 'href' values
for link in (product_links):
print(link.get_attribute('href'))
browser.get(link.get_attribute('href'))
问题是,当您执行 browser.get()
时,它会使 product_links
引用的 HTML 元素无效,因为它不再存在于当前页面中。您应该将所有 'href'
属性放入一个数组中。一种方法是使用列表理解:
links = [link.get_attribute('href') for link in product_links]
现在您可以遍历 links
中的字符串来加载新页面。
话虽如此,您应该看看库 scrapy
,它可以为您完成很多繁重的工作。