如何在 XPATH 中迭代一个变量,提取一个 link 并将其存储到列表中以供进一步迭代
How to iterate a variable in XPATH, extract a link and store it into a list for further iteration
我正在关注亚马逊价格跟踪器的 Selenium 教程(Youtube 上的智能编程),但我无法使用他们的技术从亚马逊获取 links。
教程link:https://www.youtube.com/watch?v=WbJeL_Av2-Q&t=4315s
我意识到问题在于这样一个事实,即在进行产品搜索后,我只得到了 17 个可用产品中的一个 link。我需要在搜索后获取每个产品的所有 links,然后他们使用它们来进入每个产品并获取他们的标题、卖家和价格。
函数 get_products_links() 应该获取所有 links 并将它们存储到一个列表中以供函数 get_product_info()
使用
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed in the XPATH ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
此时 get_products_links() 只有 returns 一个 link 因为我刚刚将 'i' 设置为固定值 3 以使其现在有效。
我想以某种方式迭代 'i' 以便我可以保存每个不同的路径,但我不知道如何实现它。
我尝试执行 for 循环并将结果附加到新列表中,但应用程序停止工作
完整代码如下:
from amazon_config import(
get_web_driver_options,
get_chrome_web_driver,
set_browser_as_incognito,
set_ignore_certificate_error,
NAME,
CURRENCY,
FILTERS,
BASE_URL,
DIRECTORY
)
import time
from selenium.webdriver.common.keys import Keys
class GenerateReport:
def __init__(self):
pass
class AmazonAPI:
def __init__(self, search_term, filters, base_url, currency):
self.base_url = base_url
self.search_term = search_term
options = get_web_driver_options()
set_ignore_certificate_error(options)
set_browser_as_incognito(options)
self.driver = get_chrome_web_driver(options)
self.currency = currency
self.price_filter = f"&rh=p_36%3A{filters['min']}00-{filters['max']}00"
def run(self):
print("Starting script...")
print(f"Looking for {self.search_term} products...")
links = self.get_products_links()
time.sleep(1)
if not links:
print("Stopped script.")
return
print(f"Got {len(links)} links to products...")
print("Getting info about products...")
products = self.get_products_info(links)
# self.driver.quit()
def get_products_info(self, links):
asins = self.get_asins(links)
product = []
for asin in asins:
product = self.get_single_product_info(asin)
def get_single_product_info(self, asin):
print(f"Product ID: {asin} - getting data...")
product_short_url = self.shorten_url(asin)
self.driver.get(f'{product_short_url}?language=en_GB')
time.sleep(2)
title = self.get_title()
seller = self.get_seller()
price = self.get_price()
def get_title(self):
try:
return self.driver.find_element_by_id('productTitle')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_seller(self):
try:
return self.driver.find_element_by_id('bylineInfo')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_price(self):
return ''
def shorten_url(self, asin):
return self.base_url + 'dp/' + asin
def get_asins(self, links):
return [self.get_asin(link) for link in links]
def get_asin(self, product_link):
return product_link[product_link.find('/dp/') + 4:product_link.find('/ref')]
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
if __name__ == '__main__':
print("HEY!!!")
amazon = AmazonAPI(NAME, FILTERS, BASE_URL, CURRENCY)
amazon.run()
运行 脚本的步骤:
第一步:
安装 Selenium==3.141.0 到你的虚拟环境
第 2 步:
在 google 上搜索 Chrome 驱动程序并下载与您的 Chrome 版本匹配的 driver。下载后,解压 driver 并将其粘贴到您的工作文件夹中
第 3 步:
创建一个名为 amazon_config.py 的文件并插入以下代码:
from selenium import webdriver
DIRECTORY = 'reports'
NAME = 'PS4'
CURRENCY = '$'
MIN_PRICE = '275'
MAX_PRICE = '650'
FILTERS = {
'min': MIN_PRICE,
'max': MAX_PRICE
}
BASE_URL = "https://www.amazon.com/"
def get_chrome_web_driver(options):
return webdriver.Chrome('./chromedriver', chrome_options=options)
def get_web_driver_options():
return webdriver.ChromeOptions()
def set_ignore_certificate_error(options):
options.add_argument('--ignore-certificate-errors')
def set_browser_as_incognito(options):
options.add_argument('--incognito')
如果您正确执行了这些步骤,您应该能够 运行 脚本,它将执行以下操作:
- 转到www.amazon.com
- 搜索产品(在本例中为“PS4”)
- 获得第一个产品link
- 访问该产品link
终端应打印:
HEY!!!
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
我无法做的是获取所有 links 并迭代它们,以便脚本将访问第一页中的所有 links
如果您能够获得所有 links,终端应打印:
HEY!!!
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
# and so on until all links are visited
我做不到 运行 所以我只能猜测我会怎么做。
我会把所有try/except in
用于-loop, and use links.append()
而不是links = [...]
,我会在退出循环后使用return
# --- before loop ---
links = []
# --- loop ---
for i in range(3, 18):
try:
results = result_list[0].find_elements_by_xpath(
f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
for link in results:
links.append(link.get_attribute('href'))
except Exception as e:
print(f"Didn't get any products... (i = {i})")
print(e)
# --- after loop ---
return links
但我也会尝试使用 xpath
和 //
来跳过大部分 divs
- 也许如果我跳过 div[{i}]
那么我可以获得所有产品没有 for
-loop.
顺便说一句:
在 get_products_info()
中,我看到了类似的问题 - 您创建了空列表 product = []
,但稍后在循环中您将值分配给 product = ...
,因此您从 product
中删除了先前的值。它需要 product.append()
来保留所有值。
类似
def get_products_info(self, links):
# --- before loop ---
asins = self.get_asins(links)
product = []
# --- loop ---
for asin in asins:
product.append( self.get_single_product_info(asin) )
# --- after loop ---
return product
我正在关注亚马逊价格跟踪器的 Selenium 教程(Youtube 上的智能编程),但我无法使用他们的技术从亚马逊获取 links。
教程link:https://www.youtube.com/watch?v=WbJeL_Av2-Q&t=4315s
我意识到问题在于这样一个事实,即在进行产品搜索后,我只得到了 17 个可用产品中的一个 link。我需要在搜索后获取每个产品的所有 links,然后他们使用它们来进入每个产品并获取他们的标题、卖家和价格。
函数 get_products_links() 应该获取所有 links 并将它们存储到一个列表中以供函数 get_product_info()
使用 def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed in the XPATH ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
此时 get_products_links() 只有 returns 一个 link 因为我刚刚将 'i' 设置为固定值 3 以使其现在有效。
我想以某种方式迭代 'i' 以便我可以保存每个不同的路径,但我不知道如何实现它。
我尝试执行 for 循环并将结果附加到新列表中,但应用程序停止工作
完整代码如下:
from amazon_config import(
get_web_driver_options,
get_chrome_web_driver,
set_browser_as_incognito,
set_ignore_certificate_error,
NAME,
CURRENCY,
FILTERS,
BASE_URL,
DIRECTORY
)
import time
from selenium.webdriver.common.keys import Keys
class GenerateReport:
def __init__(self):
pass
class AmazonAPI:
def __init__(self, search_term, filters, base_url, currency):
self.base_url = base_url
self.search_term = search_term
options = get_web_driver_options()
set_ignore_certificate_error(options)
set_browser_as_incognito(options)
self.driver = get_chrome_web_driver(options)
self.currency = currency
self.price_filter = f"&rh=p_36%3A{filters['min']}00-{filters['max']}00"
def run(self):
print("Starting script...")
print(f"Looking for {self.search_term} products...")
links = self.get_products_links()
time.sleep(1)
if not links:
print("Stopped script.")
return
print(f"Got {len(links)} links to products...")
print("Getting info about products...")
products = self.get_products_info(links)
# self.driver.quit()
def get_products_info(self, links):
asins = self.get_asins(links)
product = []
for asin in asins:
product = self.get_single_product_info(asin)
def get_single_product_info(self, asin):
print(f"Product ID: {asin} - getting data...")
product_short_url = self.shorten_url(asin)
self.driver.get(f'{product_short_url}?language=en_GB')
time.sleep(2)
title = self.get_title()
seller = self.get_seller()
price = self.get_price()
def get_title(self):
try:
return self.driver.find_element_by_id('productTitle')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_seller(self):
try:
return self.driver.find_element_by_id('bylineInfo')
except Exception as e:
print(e)
print(f"Can't get title of a product - {self.driver.current_url}")
return None
def get_price(self):
return ''
def shorten_url(self, asin):
return self.base_url + 'dp/' + asin
def get_asins(self, links):
return [self.get_asin(link) for link in links]
def get_asin(self, product_link):
return product_link[product_link.find('/dp/') + 4:product_link.find('/ref')]
def get_products_links(self):
self.driver.get(self.base_url) # Go to amazon.com using BASE_URL
element = self.driver.find_element_by_id('twotabsearchtextbox')
element.send_keys(self.search_term)
element.send_keys(Keys.ENTER)
time.sleep(2) # Wait to load page
self.driver.get(f'{self.driver.current_url}{self.price_filter}')
time.sleep(2) # Wait to load page
result_list = self.driver.find_elements_by_class_name('s-result-list')
links = []
try:
### Tying to get a list for Xpath links attributes ###
### Only numbers from 3 to 17 work after doing product search where 'i' is placed ###
i = 3
results = result_list[0].find_elements_by_xpath(
f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
links = [link.get_attribute('href') for link in results]
return links
except Exception as e:
print("Didn't get any products...")
print(e)
return links
if __name__ == '__main__':
print("HEY!!!")
amazon = AmazonAPI(NAME, FILTERS, BASE_URL, CURRENCY)
amazon.run()
运行 脚本的步骤:
第一步: 安装 Selenium==3.141.0 到你的虚拟环境
第 2 步: 在 google 上搜索 Chrome 驱动程序并下载与您的 Chrome 版本匹配的 driver。下载后,解压 driver 并将其粘贴到您的工作文件夹中
第 3 步: 创建一个名为 amazon_config.py 的文件并插入以下代码:
from selenium import webdriver
DIRECTORY = 'reports'
NAME = 'PS4'
CURRENCY = '$'
MIN_PRICE = '275'
MAX_PRICE = '650'
FILTERS = {
'min': MIN_PRICE,
'max': MAX_PRICE
}
BASE_URL = "https://www.amazon.com/"
def get_chrome_web_driver(options):
return webdriver.Chrome('./chromedriver', chrome_options=options)
def get_web_driver_options():
return webdriver.ChromeOptions()
def set_ignore_certificate_error(options):
options.add_argument('--ignore-certificate-errors')
def set_browser_as_incognito(options):
options.add_argument('--incognito')
如果您正确执行了这些步骤,您应该能够 运行 脚本,它将执行以下操作:
- 转到www.amazon.com
- 搜索产品(在本例中为“PS4”)
- 获得第一个产品link
- 访问该产品link
终端应打印:
HEY!!!
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
我无法做的是获取所有 links 并迭代它们,以便脚本将访问第一页中的所有 links
如果您能够获得所有 links,终端应打印:
HEY!!!
Starting script...
Looking for PS4 products...
Got 1 links to products...
Getting info about products...
Product ID: B012CZ41ZA - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
Product ID: XXXXXXXXXX - getting data...
# and so on until all links are visited
我做不到 运行 所以我只能猜测我会怎么做。
我会把所有try/except in
用于-loop, and use links.append()
而不是links = [...]
,我会在退出循环后使用return
# --- before loop ---
links = []
# --- loop ---
for i in range(3, 18):
try:
results = result_list[0].find_elements_by_xpath(
f'//*[@id="search"]/div[1]/div[1]/div/span[3]/div[2]/div[{i}]/div/div/div/div/div/div[1]/div/div[2]/div/span/a')
for link in results:
links.append(link.get_attribute('href'))
except Exception as e:
print(f"Didn't get any products... (i = {i})")
print(e)
# --- after loop ---
return links
但我也会尝试使用 xpath
和 //
来跳过大部分 divs
- 也许如果我跳过 div[{i}]
那么我可以获得所有产品没有 for
-loop.
顺便说一句:
在 get_products_info()
中,我看到了类似的问题 - 您创建了空列表 product = []
,但稍后在循环中您将值分配给 product = ...
,因此您从 product
中删除了先前的值。它需要 product.append()
来保留所有值。
类似
def get_products_info(self, links):
# --- before loop ---
asins = self.get_asins(links)
product = []
# --- loop ---
for asin in asins:
product.append( self.get_single_product_info(asin) )
# --- after loop ---
return product