如何在 XPATH 中迭代一个变量,提取一个 link 并将其存储到列表中以供进一步迭代

How to iterate a variable in XPATH, extract a link and store it into a list for further iteration

我正在关注亚马逊价格跟踪器的 Selenium 教程(Youtube 上的智能编程),但我无法使用他们的技术从亚马逊获取 links。

教程link:https://www.youtube.com/watch?v=WbJeL_Av2-Q&t=4315s

我意识到问题在于这样一个事实,即在进行产品搜索后,我只得到了 17 个可用产品中的一个 link。我需要在搜索后获取每个产品的所有 links,然后他们使用它们来进入每个产品并获取他们的标题、卖家和价格。

函数 get_products_links() 应该获取所有 links 并将它们存储到一个列表中以供函数 get_product_info()

使用
    def get_products_links(self):
    self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
    element = self.driver.find_element_by_id('twotabsearchtextbox')
    element.send_keys(self.search_term)
    element.send_keys(Keys.ENTER)
    time.sleep(2) # Wait to load page
    self.driver.get(f'{self.driver.current_url}{self.price_filter}')
    time.sleep(2) # Wait to load page
    result_list = self.driver.find_elements_by_class_name('s-result-list')

    links = []
    try:
        ### Tying to get a list for Xpath links attributes ###
        ### Only numbers from 3 to 17 work after doing product search where 'i' is placed in the XPATH ###
        i = 3
        results = result_list[0].find_elements_by_xpath(
            f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
        links = [link.get_attribute('href') for link in results]
        return links
    except Exception as e:
        print("Didn't get any products...")
        print(e)
        return links

此时 get_products_links() 只有 returns 一个 link 因为我刚刚将 'i' 设置为固定值 3 以使其现在有效。

我想以某种方式迭代 'i' 以便我可以保存每个不同的路径,但我不知道如何实现它。

我尝试执行 for 循环并将结果附加到新列表中,但应用程序停止工作

完整代码如下:

from amazon_config import(
get_web_driver_options,
get_chrome_web_driver,
set_browser_as_incognito,
set_ignore_certificate_error,
NAME,
CURRENCY,
FILTERS,
BASE_URL,
DIRECTORY
)
import time
from selenium.webdriver.common.keys import Keys

class GenerateReport:
    def __init__(self):
    pass
class AmazonAPI:
def __init__(self, search_term, filters, base_url, currency):
    self.base_url = base_url
    self.search_term = search_term
    options = get_web_driver_options()
    set_ignore_certificate_error(options)
    set_browser_as_incognito(options)
    self.driver = get_chrome_web_driver(options)
    self.currency = currency
    self.price_filter = f"&rh=p_36%3A{filters['min']}00-{filters['max']}00"
    
def run(self):
    print("Starting script...")
    print(f"Looking for {self.search_term} products...")
    links = self.get_products_links()
    time.sleep(1)
    if not links:
        print("Stopped script.")
        return
    print(f"Got {len(links)} links to products...")
    print("Getting info about products...")
    products = self.get_products_info(links)

    # self.driver.quit()

def get_products_info(self, links):
    asins = self.get_asins(links)
    product = []
    for asin in asins:
        product = self.get_single_product_info(asin)

def get_single_product_info(self, asin):
    print(f"Product ID: {asin} - getting data...")
    product_short_url = self.shorten_url(asin)
    self.driver.get(f'{product_short_url}?language=en_GB')
    time.sleep(2)
    title = self.get_title()
    seller = self.get_seller()
    price = self.get_price()

def get_title(self):
    try:
        return self.driver.find_element_by_id('productTitle')
    except Exception as e:
        print(e)
        print(f"Can't get title of a product - {self.driver.current_url}")
        return None

def get_seller(self):
    try:
        return self.driver.find_element_by_id('bylineInfo')
    except Exception as e:
        print(e)
        print(f"Can't get title of a product - {self.driver.current_url}")
        return None

def get_price(self):
    return ''

def shorten_url(self, asin):
    return self.base_url + 'dp/' + asin

def get_asins(self, links):
    return [self.get_asin(link) for link in links]

def get_asin(self, product_link):
    return product_link[product_link.find('/dp/') + 4:product_link.find('/ref')]
    
def get_products_links(self):
    self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
    element = self.driver.find_element_by_id('twotabsearchtextbox')
    element.send_keys(self.search_term)
    element.send_keys(Keys.ENTER)
    time.sleep(2) # Wait to load page
    self.driver.get(f'{self.driver.current_url}{self.price_filter}')
    time.sleep(2) # Wait to load page
    result_list = self.driver.find_elements_by_class_name('s-result-list')

    links = []
    try:
        ### Tying to get a list for Xpath links attributes ###
        ### Only numbers from 3 to 17 work after doing product search where 'i' is placed ###
        i = 3
        results = result_list[0].find_elements_by_xpath(
            f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
            
        links = [link.get_attribute('href') for link in results]
        return links
    except Exception as e:
        print("Didn't get any products...")
        print(e)
        return links


  if __name__ == '__main__':
print("HEY!!!")
amazon = AmazonAPI(NAME, FILTERS, BASE_URL, CURRENCY)
amazon.run()

运行 脚本的步骤:

第一步: 安装 Selenium==3.141.0 到你的虚拟环境

第 2 步: 在 google 上搜索 Chrome 驱动程序并下载与您的 Chrome 版本匹配的 driver。下载后,解压 driver 并将其粘贴到您的工作文件夹中

第 3 步: 创建一个名为 amazon_config.py 的文件并插入以下代码:

from selenium import webdriver

DIRECTORY = 'reports'
NAME = 'PS4'
CURRENCY = '$'
MIN_PRICE = '275'
MAX_PRICE = '650'
FILTERS = {
  'min': MIN_PRICE,
  'max': MAX_PRICE
}
BASE_URL = "https://www.amazon.com/"

def get_chrome_web_driver(options):
  return webdriver.Chrome('./chromedriver', chrome_options=options)

def get_web_driver_options():
  return webdriver.ChromeOptions()

def set_ignore_certificate_error(options):
  options.add_argument('--ignore-certificate-errors')

def set_browser_as_incognito(options):
  options.add_argument('--incognito')

如果您正确执行了这些步骤,您应该能够 运行 脚本,它将执行以下操作:

  1. 转到www.amazon.com
  2. 搜索产品(在本例中为“PS4”)
  3. 获得第一个产品link
  4. 访问该产品link

终端应打印:

HEY!!!
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...

我无法做的是获取所有 links 并迭代它们,以便脚本将访问第一页中的所有 links

如果您能够获得所有 links,终端应打印:

HEY!!!
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
 # and so on until all links are visited 

我做不到 运行 所以我只能猜测我会怎么做。

我会把所有try/except in 用于-loop, and use links.append()而不是links = [...],我会在退出循环后使用return

    # --- before loop ---
    
    links = []
    
    # --- loop ---
    
    for i in range(3, 18):
        try:
            results = result_list[0].find_elements_by_xpath(
            f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
            
            for link in results:
                links.append(link.get_attribute('href'))
                
        except Exception as e:
            print(f"Didn't get any products... (i = {i})")
            print(e)
        
    # --- after loop ---
    
    return links

但我也会尝试使用 xpath// 来跳过大部分 divs - 也许如果我跳过 div[{i}] 那么我可以获得所有产品没有 for-loop.


顺便说一句:

get_products_info() 中,我看到了类似的问题 - 您创建了空列表 product = [],但稍后在循环中您将值分配给 product = ...,因此您从 product 中删除了先前的值。它需要 product.append() 来保留所有值。

类似

def get_products_info(self, links):

    # --- before loop ---
    asins = self.get_asins(links)
    product = []

    # --- loop ---
    for asin in asins:
        product.append( self.get_single_product_info(asin) )

    # --- after loop ---
    return product