检查网站更新(使用 Python + Selenium 的 Web 自动化)
Checking Websites for Updates (Web Automation with Python + Selenium)
我正在尝试编写一个执行以下操作的简单脚本:
- 每 6 小时自动运行一次
- 检查房地产网站
新上市
- 如果找到新的列表详细信息,请通过电子邮件发送,否则
终止脚本直到下一个 运行
我打算使用 crontab 来执行 (1)。此外,这是我到目前为止为一个特定网站提出的脚本:
from selenium import webdriver
import smtplib
import sys
driver = webdriver.Firefox()
#Capital Pacific Website
#Commerical Real Estate
#open text file containing property titles we already know about
properties = open("properties.txt", "r+")
currentList = []
for line in properties:
currentList.append(line)
#to search for new listings
driver.get("http://cp.capitalpacific.com/Properties")
assert "Capital" in driver.title
#holds any new listings
newProperties = []
#find all listings on page by Property Name
newList = driver.find_elements_by_class_name('overview')
#find elements in pageList not in oldList & add to newList
#add new elements to
for x in currentList:
for y in newList:
if y != x:
newProperties.append(y)
properties.write(y)
properties.close()
driver.close()
#if no new properties found, terminate script
#else, email properties
if not newProperties:
sys.exit()
else:
fromaddr = 'someone@gmail.com'
toaddrs = ['someoneelse@yahoo.com']
server = smtplib.SMTP('smtp.gmail.com:587')
server.starttls()
for item in newProperties:
msg = item
server.sendmail(fromaddr, toaddrs, msg)
server.quit()
我目前的问题:(请耐心等待,因为我是 python 新手..)
Using a list to store the web elements returned by selenium's "find by class" method: Is there a better way to write in/out from the text file to ensure I am only getting the newly added properties 有没有更好的方法从文本文件中写入 in/out 以确保我只获取新添加的属性?
如果脚本确实找到了网站上但不在 newList 上的 class 属性,有没有办法让我可以通过 div 仅在为了获取有关列表的详细信息?
任何 suggestions/recommendations 请!谢谢。
如果您改用 JSON
格式并将列表存储为字典会怎么样:
[
{
"location": "REGON CITY, OR",
"price": 33000000,
"status": "active",
"marketing_package_url": "http://www.capitalpacific.com/inquiry/TrailsEndMarketplaceExecSummary.pdf"
...
},
...
]
您需要每个 属性 都有一些独特之处,以便识别新列表。例如,您可以使用营销包 url - 对我来说看起来很独特。
这是从页面获取列表列表的示例代码:
properties = []
for property in driver.find_elements_by_css_selector('table.property div.property'):
title = property.find_element_by_css_selector('div.title h2')
location = property.find_element_by_css_selector('div.title h4')
marketing_package = property.find_element_by_partial_link_text('Marketing Package')
properties.append({
'title': title.text,
'location': location.text,
'marketing_package_url': marketing_package.getAttribute('href')
})
我正在尝试编写一个执行以下操作的简单脚本:
- 每 6 小时自动运行一次
- 检查房地产网站 新上市
- 如果找到新的列表详细信息,请通过电子邮件发送,否则 终止脚本直到下一个 运行
我打算使用 crontab 来执行 (1)。此外,这是我到目前为止为一个特定网站提出的脚本:
from selenium import webdriver
import smtplib
import sys
driver = webdriver.Firefox()
#Capital Pacific Website
#Commerical Real Estate
#open text file containing property titles we already know about
properties = open("properties.txt", "r+")
currentList = []
for line in properties:
currentList.append(line)
#to search for new listings
driver.get("http://cp.capitalpacific.com/Properties")
assert "Capital" in driver.title
#holds any new listings
newProperties = []
#find all listings on page by Property Name
newList = driver.find_elements_by_class_name('overview')
#find elements in pageList not in oldList & add to newList
#add new elements to
for x in currentList:
for y in newList:
if y != x:
newProperties.append(y)
properties.write(y)
properties.close()
driver.close()
#if no new properties found, terminate script
#else, email properties
if not newProperties:
sys.exit()
else:
fromaddr = 'someone@gmail.com'
toaddrs = ['someoneelse@yahoo.com']
server = smtplib.SMTP('smtp.gmail.com:587')
server.starttls()
for item in newProperties:
msg = item
server.sendmail(fromaddr, toaddrs, msg)
server.quit()
我目前的问题:(请耐心等待,因为我是 python 新手..)
Using a list to store the web elements returned by selenium's "find by class" method: Is there a better way to write in/out from the text file to ensure I am only getting the newly added properties 有没有更好的方法从文本文件中写入 in/out 以确保我只获取新添加的属性?
如果脚本确实找到了网站上但不在 newList 上的 class 属性,有没有办法让我可以通过 div 仅在为了获取有关列表的详细信息?
任何 suggestions/recommendations 请!谢谢。
如果您改用 JSON
格式并将列表存储为字典会怎么样:
[
{
"location": "REGON CITY, OR",
"price": 33000000,
"status": "active",
"marketing_package_url": "http://www.capitalpacific.com/inquiry/TrailsEndMarketplaceExecSummary.pdf"
...
},
...
]
您需要每个 属性 都有一些独特之处,以便识别新列表。例如,您可以使用营销包 url - 对我来说看起来很独特。
这是从页面获取列表列表的示例代码:
properties = []
for property in driver.find_elements_by_css_selector('table.property div.property'):
title = property.find_element_by_css_selector('div.title h2')
location = property.find_element_by_css_selector('div.title h4')
marketing_package = property.find_element_by_partial_link_text('Marketing Package')
properties.append({
'title': title.text,
'location': location.text,
'marketing_package_url': marketing_package.getAttribute('href')
})