如何使用 selenium python 从存储的链接列表中提取值或数据?
How to extract values or data from a list of stored links using selenium python?
我正在尝试抓取一个房地产网站的价格,即 this one, so I made a list of scraped links and wrote scripts to get prices from all those links. I tried googling and asking around but could not find a decent answer, I just want to get price values from list of links and store it in a way so that it can be converted into a csv file later on with house name, location,price as headers along with respective datas . The output I am getting is: 。我想要的是最后一个包含很多价格的列表。我的代码如下
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv') #reading csv file which contains 8 links
de = data_extract['Links'].tolist() #converting the csv file to list so that it can be iterated
data=[] # created an empty list to store extracted prices after the scraping is done from homie.csv
for url in de[0:]: #de has all the links which i want to iterate and scrape prices
driver.get(url)
prices = driver.find_elements_by_xpath("//div[@id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data) # printing in console just to check what kind of data i obtained
如有任何帮助,我们将不胜感激。我期待的输出是这样的 [[link 0 内的房屋价格],[link 1 内的房屋价格],类似地].. link 中的 [=] 18=]如下
Links
https://www.nepalhomes.com/detail/bungalow-house-for-sale-at-mandikhatar
https://www.nepalhomes.com/detail/brand-new-house-for-sale-in-baluwakhani
https://www.nepalhomes.com/detail/bungalow-house-for-sale-in-bhangal-budhanilkantha
https://www.nepalhomes.com/detail/commercial-house-for-sale-in-mandikhatar
https://www.nepalhomes.com/detail/attractive-house-on-sale-in-budhanilkantha
https://www.nepalhomes.com/detail/house-on-sale-at-bafal
https://www.nepalhomes.com/detail/house-on-sale-in-madigaun-sunakothi
https://www.nepalhomes.com/detail/house-on-sale-in-chhaling-bhaktapur
我在这里看到几个问题:
- 我在
https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a
网页上看不到任何与 text-3xl font-bold leading-none text-black
class 名称匹配的元素
- 即使有这样的元素 - 对于多个 class 名称,您应该使用 CSS 选择器或 XPath,而不是
find_elements_by_class_name('text-3xl font-bold leading-none text-black')
应该是
find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
find_elements
方法 returns 网络元素列表,因此要从这些元素中获取文本,您必须遍历列表并从每个元素中获取文本,如下所示:
prices = driver.find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
for price in prices:
data.append(price.text)
UPD
有了这个定位器,它对我来说是正确的:
prices = driver.find_elements_by_xpath("//p[@class='text-xl leading-none text-black']/p[1]")
for price in prices:
data.append(price.text)
不需要使用Selenium来获取你需要的数据。该页面从 API 端点加载数据。
API端点:
https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a
您可以使用 requests
模块直接向 API 端点发出请求并获取您的数据。
此代码将打印所有价格。
import requests
url = 'https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a'
r = requests.get(url)
info = r.json()
for i in info['data']:
print([i['basic']['title'],i['price']['value']])
['House on sale at Kapan near Karuna Hospital ', 15500000]
['House on sale at Banasthali', 70000000]
['Bungalow house for sale at Mandikhatar', 38000000]
['Brand new house for sale in Baluwakhani', 38000000]
['Bungalow house for sale in Bhangal, Budhanilkantha', 29000000]
['Commercial house for sale in Mandikhatar', 27500000]
['Attractive house on sale in Budhanilkantha', 55000000]
['House on sale at Bafal', 45000000]
已尝试使用以下 xpath。它取回了奖品。
price_list,nameprice_list = [],[]
houses = driver.find_elements_by_xpath("//div[contains(@class,'table-list')]/a")
for house in houses:
name = house.find_element_by_tag_name("h2").text
address = house.find_element_by_xpath(".//p[contains(@class,'opacity-75')]").text
price = (house.find_element_by_xpath(".//p[contains(@class,'text-xl')]/p").text).replace('Rs. ','')
price_list.append(price)
nameprice_list.append((name,price))
print("{}: {}".format(name,price))
并输出:
House on sale at Kapan near Karuna Hospital: Kapan, Budhanilkantha Municipality,1,55,00,000
House on sale at Banasthali: Banasthali, Kathmandu Metropolitan City,7,00,00,000
...
[('House on sale at Kapan near Karuna Hospital', '1,55,00,000'), ('House on sale at Banasthali', '7,00,00,000'), ('Bungalow house for sale at Mandikhatar', '3,80,00,000'), ('Brand new house for sale in Baluwakhani', '3,80,00,000'), ('Bungalow house for sale in Bhangal, Budhanilkantha', '2,90,00,000'), ('Commercial house for sale in Mandikhatar', '2,75,00,000'), ('Attractive house on sale in Budhanilkantha', '5,50,00,000'), ('House on sale at Bafal', '4,50,00,000')]
['1,55,00,000', '7,00,00,000', '3,80,00,000', '3,80,00,000', '2,90,00,000', '2,75,00,000', '5,50,00,000', '4,50,00,000']
乍一看,只有 8 个价格可见,如果您只想使用 selenium 抓取它们
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
wait = WebDriverWait(driver, 20)
for price in driver.find_elements(By.XPATH, "//p[contains(@class,'leading')]/p[1]"):
print(price.text.split('.')[1])
这将打印所有价格,没有 RS.
此打印语句应在 for 循环之外以避免输出的阶梯式打印。
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv')
de = data_extract['Links'].tolist()
data=[]
for url in de[0:]:
driver.get(url)
prices = driver.find_elements_by_xpath("//div[@id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data)
我正在尝试抓取一个房地产网站的价格,即 this one, so I made a list of scraped links and wrote scripts to get prices from all those links. I tried googling and asking around but could not find a decent answer, I just want to get price values from list of links and store it in a way so that it can be converted into a csv file later on with house name, location,price as headers along with respective datas . The output I am getting is:
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv') #reading csv file which contains 8 links
de = data_extract['Links'].tolist() #converting the csv file to list so that it can be iterated
data=[] # created an empty list to store extracted prices after the scraping is done from homie.csv
for url in de[0:]: #de has all the links which i want to iterate and scrape prices
driver.get(url)
prices = driver.find_elements_by_xpath("//div[@id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data) # printing in console just to check what kind of data i obtained
如有任何帮助,我们将不胜感激。我期待的输出是这样的 [[link 0 内的房屋价格],[link 1 内的房屋价格],类似地].. link 中的 [=] 18=]如下
Links
https://www.nepalhomes.com/detail/bungalow-house-for-sale-at-mandikhatar
https://www.nepalhomes.com/detail/brand-new-house-for-sale-in-baluwakhani
https://www.nepalhomes.com/detail/bungalow-house-for-sale-in-bhangal-budhanilkantha
https://www.nepalhomes.com/detail/commercial-house-for-sale-in-mandikhatar
https://www.nepalhomes.com/detail/attractive-house-on-sale-in-budhanilkantha
https://www.nepalhomes.com/detail/house-on-sale-at-bafal
https://www.nepalhomes.com/detail/house-on-sale-in-madigaun-sunakothi
https://www.nepalhomes.com/detail/house-on-sale-in-chhaling-bhaktapur
我在这里看到几个问题:
- 我在
https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a
网页上看不到任何与text-3xl font-bold leading-none text-black
class 名称匹配的元素 - 即使有这样的元素 - 对于多个 class 名称,您应该使用 CSS 选择器或 XPath,而不是
find_elements_by_class_name('text-3xl font-bold leading-none text-black')
应该是
find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
find_elements
方法 returns 网络元素列表,因此要从这些元素中获取文本,您必须遍历列表并从每个元素中获取文本,如下所示:
prices = driver.find_elements_by_css_selector('.text-3xl.font-bold.leading-none.text-black')
for price in prices:
data.append(price.text)
UPD
有了这个定位器,它对我来说是正确的:
prices = driver.find_elements_by_xpath("//p[@class='text-xl leading-none text-black']/p[1]")
for price in prices:
data.append(price.text)
不需要使用Selenium来获取你需要的数据。该页面从 API 端点加载数据。
API端点:
https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a
您可以使用 requests
模块直接向 API 端点发出请求并获取您的数据。
此代码将打印所有价格。
import requests
url = 'https://www.nepalhomes.com/api/property/public/data?&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a'
r = requests.get(url)
info = r.json()
for i in info['data']:
print([i['basic']['title'],i['price']['value']])
['House on sale at Kapan near Karuna Hospital ', 15500000]
['House on sale at Banasthali', 70000000]
['Bungalow house for sale at Mandikhatar', 38000000]
['Brand new house for sale in Baluwakhani', 38000000]
['Bungalow house for sale in Bhangal, Budhanilkantha', 29000000]
['Commercial house for sale in Mandikhatar', 27500000]
['Attractive house on sale in Budhanilkantha', 55000000]
['House on sale at Bafal', 45000000]
已尝试使用以下 xpath。它取回了奖品。
price_list,nameprice_list = [],[]
houses = driver.find_elements_by_xpath("//div[contains(@class,'table-list')]/a")
for house in houses:
name = house.find_element_by_tag_name("h2").text
address = house.find_element_by_xpath(".//p[contains(@class,'opacity-75')]").text
price = (house.find_element_by_xpath(".//p[contains(@class,'text-xl')]/p").text).replace('Rs. ','')
price_list.append(price)
nameprice_list.append((name,price))
print("{}: {}".format(name,price))
并输出:
House on sale at Kapan near Karuna Hospital: Kapan, Budhanilkantha Municipality,1,55,00,000
House on sale at Banasthali: Banasthali, Kathmandu Metropolitan City,7,00,00,000
...
[('House on sale at Kapan near Karuna Hospital', '1,55,00,000'), ('House on sale at Banasthali', '7,00,00,000'), ('Bungalow house for sale at Mandikhatar', '3,80,00,000'), ('Brand new house for sale in Baluwakhani', '3,80,00,000'), ('Bungalow house for sale in Bhangal, Budhanilkantha', '2,90,00,000'), ('Commercial house for sale in Mandikhatar', '2,75,00,000'), ('Attractive house on sale in Budhanilkantha', '5,50,00,000'), ('House on sale at Bafal', '4,50,00,000')]
['1,55,00,000', '7,00,00,000', '3,80,00,000', '3,80,00,000', '2,90,00,000', '2,75,00,000', '5,50,00,000', '4,50,00,000']
乍一看,只有 8 个价格可见,如果您只想使用 selenium 抓取它们
driver.maximize_window()
driver.implicitly_wait(30)
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
wait = WebDriverWait(driver, 20)
for price in driver.find_elements(By.XPATH, "//p[contains(@class,'leading')]/p[1]"):
print(price.text.split('.')[1])
这将打印所有价格,没有 RS.
此打印语句应在 for 循环之外以避免输出的阶梯式打印。
from selenium import webdriver
PATH = "C:/ProgramData/Anaconda3/scripts/chromedriver.exe" #always keeps chromedriver.exe inside scripts to save hours of debugging
driver =webdriver.Chrome(PATH) #preety important part
driver.get("https://www.nepalhomes.com/list/&sort=1&find_property_purpose=5db2bdb42485621618ecdae6&find_property_category=5d660cb27682d03f547a6c4a")
driver.implicitly_wait(10)
data_extract= pd.read_csv(r'F:\github projects\homie.csv')
de = data_extract['Links'].tolist()
data=[]
for url in de[0:]:
driver.get(url)
prices = driver.find_elements_by_xpath("//div[@id='app']/div[1]/div[2]/div[1]/div[2]/div/p[1]")
for price in prices: #after finding xapth get prices
data.append(price.text)
print(data)