网页抓取,退货通知
web scraping, back in stock notification
我想设置一个 Python 脚本来告诉我产品是否有货。目前它抓取下面的 url 并解析网站的相关部分,但我不知道如何获取这个我称为 stock 的输出变量并将其存储为另一个名为 stock_history 的变量然后 运行 另一行询问库存是否等于 stock_history
我在尝试将 html 数据存储在 stock_history 中时扫描字符串文字错误时也遇到了 EOL。有更好的方法吗?
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import csv
now = datetime.now()
#enter website address
url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')
soup = BeautifulSoup(url.content,'html')
stock = (soup.find("div", "buy-now-bar-con"))
stock_history = '<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4; cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'
print(stock)
if stock == stock_history
print("still not in stock")
首先,EOL 代表 "End of Line",如果 python 不喜欢您定义字符串的方式或使用了一些奇怪的字符,您通常会收到此错误。为避免这种情况,您可以在原始代码中对字符串进行三重引号,如下所示:
stock_history = '''<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?
flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?
flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4;
cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'''
这很丑,所以我去掉了那个大字符串,因为它没有必要。您应该从 stock
变量中获得的唯一信息是产品是否已售罄。为此,您可以将 bs4.element.Tag
转换为 str
类型并使用正则表达式检查是否存在 "sold out!" 子字符串。正则表达式确实在您进行数据抓取、处理文本数据或执行任何形式的 XML 或 HTML 解析时派上用场,因此我鼓励您阅读它们。
更多信息:https://www.regular-expressions.info/
您可以在此处轻松测试 python 正则表达式捕获:https://pythex.org/
这是修改后的代码,它完成了您想要让它做的事情。
import re
import csv
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def stock_check(url):
"""Function checks url for 'sold out!' substring in url.content"""
soup = BeautifulSoup(url.content, "lxml") #Need to use lxml parser
stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
stock_status = re.findall(r"(sold out!)", str(stock)) #Returns list of captured substring if exists.
return stock_status[0] # returns "sold out!" from soup string.
now = datetime.now()
url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')
if stock_check(url) == "sold out!":
print(str(now) + ": Still not in stock...")
else:
print(str(now) + ": Now in stock!")
尝试一下,如果您有任何问题,请告诉我!
编辑:OP 询问如何定期检查网页并包括电子邮件通知。需要对原始解决方案进行一些更改,例如在 requests headers
字段中设置 userAgent
信息。还为 BeautifulSoup
对象切换到 html.parser
而不是 lxml
以正确处理 url.content
.
中的 javascript
import re
import time
import smtplib
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def stock_check(url):
"""Checks url for 'sold out!' substring in buy-now-bar-con"""
soup = BeautifulSoup(url.content, "html.parser") #Need to use lxml parser
stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
stock_status = re.findall(r"sold out!", str(stock)) #Returns list of captured substring if exists.
return stock_status # returns "sold out!" from soup string.
def send_email(address, password, message):
"""Send an e-mail to yourself!"""
server = smtplib.SMTP("smtp.gmail.com", 587) #e-mail server
server.ehlo()
server.starttls()
server.login(address,password) #login
message = str(message) #message to email yourself
server.sendmail(address,address,message) #send the email through dedicated server
return
def stock_check_listener(url, address, password, run_hours):
"""Periodically checks stock information."""
listen = True # listen boolean
start = datetime.now() # start time
while(listen): #while listen = True, run loop
if "sold out!" in stock_check(url): #check page
now = datetime.now()
print(str(now) + ": Not in stock.")
else:
message = str(now) + ": NOW IN STOCK!"
print(message)
send_email(address, password, message)
listen = False
duration = (now - start)
seconds = duration.total_seconds()
hours = int(seconds/3600)
if hours >= run_hours: #check run time
print("Finished.")
listen = False
time.sleep(30*60) #Wait N minutes to check again.
return
if __name__=="__main__":
#Set url and userAgent header for javascript issues.
page = "https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html'}
#URL request.
url = requests.get(url=page,
headers=headers)
#Run listener to stream stock checks.
address = "user@gmail.com" #your email
password = "user.password" #your email password
stock_check_listener(url=url,
address=address,
password=password,
run_hours=1)
现在,程序将启动一个 while
循环,定期从网页请求信息。您可以通过更改 run_hours
变量来设置超时(以小时为单位)。您还可以通过在 stock_check_listener
内更改 N
来设置 sleep/wait 时间(以分钟为单位)。我在这种情况下使用 gmail
,如果您在给自己发送电子邮件时收到错误消息,那么您将需要遵循此 link:https://myaccount.google.com/lesssecureapps,并允许安全性较低的应用程序(您的 python 程序)访问您的 gmail 帐户。
我想设置一个 Python 脚本来告诉我产品是否有货。目前它抓取下面的 url 并解析网站的相关部分,但我不知道如何获取这个我称为 stock 的输出变量并将其存储为另一个名为 stock_history 的变量然后 运行 另一行询问库存是否等于 stock_history
我在尝试将 html 数据存储在 stock_history 中时扫描字符串文字错误时也遇到了 EOL。有更好的方法吗?
import requests
from datetime import datetime
from bs4 import BeautifulSoup
import csv
now = datetime.now()
#enter website address
url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')
soup = BeautifulSoup(url.content,'html')
stock = (soup.find("div", "buy-now-bar-con"))
stock_history = '<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4; cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'
print(stock)
if stock == stock_history
print("still not in stock")
首先,EOL 代表 "End of Line",如果 python 不喜欢您定义字符串的方式或使用了一些奇怪的字符,您通常会收到此错误。为避免这种情况,您可以在原始代码中对字符串进行三重引号,如下所示:
stock_history = '''<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?
flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?
flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4;
cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'''
这很丑,所以我去掉了那个大字符串,因为它没有必要。您应该从 stock
变量中获得的唯一信息是产品是否已售罄。为此,您可以将 bs4.element.Tag
转换为 str
类型并使用正则表达式检查是否存在 "sold out!" 子字符串。正则表达式确实在您进行数据抓取、处理文本数据或执行任何形式的 XML 或 HTML 解析时派上用场,因此我鼓励您阅读它们。
更多信息:https://www.regular-expressions.info/
您可以在此处轻松测试 python 正则表达式捕获:https://pythex.org/
这是修改后的代码,它完成了您想要让它做的事情。
import re
import csv
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def stock_check(url):
"""Function checks url for 'sold out!' substring in url.content"""
soup = BeautifulSoup(url.content, "lxml") #Need to use lxml parser
stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
stock_status = re.findall(r"(sold out!)", str(stock)) #Returns list of captured substring if exists.
return stock_status[0] # returns "sold out!" from soup string.
now = datetime.now()
url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')
if stock_check(url) == "sold out!":
print(str(now) + ": Still not in stock...")
else:
print(str(now) + ": Now in stock!")
尝试一下,如果您有任何问题,请告诉我!
编辑:OP 询问如何定期检查网页并包括电子邮件通知。需要对原始解决方案进行一些更改,例如在 requests headers
字段中设置 userAgent
信息。还为 BeautifulSoup
对象切换到 html.parser
而不是 lxml
以正确处理 url.content
.
import re
import time
import smtplib
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def stock_check(url):
"""Checks url for 'sold out!' substring in buy-now-bar-con"""
soup = BeautifulSoup(url.content, "html.parser") #Need to use lxml parser
stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
stock_status = re.findall(r"sold out!", str(stock)) #Returns list of captured substring if exists.
return stock_status # returns "sold out!" from soup string.
def send_email(address, password, message):
"""Send an e-mail to yourself!"""
server = smtplib.SMTP("smtp.gmail.com", 587) #e-mail server
server.ehlo()
server.starttls()
server.login(address,password) #login
message = str(message) #message to email yourself
server.sendmail(address,address,message) #send the email through dedicated server
return
def stock_check_listener(url, address, password, run_hours):
"""Periodically checks stock information."""
listen = True # listen boolean
start = datetime.now() # start time
while(listen): #while listen = True, run loop
if "sold out!" in stock_check(url): #check page
now = datetime.now()
print(str(now) + ": Not in stock.")
else:
message = str(now) + ": NOW IN STOCK!"
print(message)
send_email(address, password, message)
listen = False
duration = (now - start)
seconds = duration.total_seconds()
hours = int(seconds/3600)
if hours >= run_hours: #check run time
print("Finished.")
listen = False
time.sleep(30*60) #Wait N minutes to check again.
return
if __name__=="__main__":
#Set url and userAgent header for javascript issues.
page = "https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html'}
#URL request.
url = requests.get(url=page,
headers=headers)
#Run listener to stream stock checks.
address = "user@gmail.com" #your email
password = "user.password" #your email password
stock_check_listener(url=url,
address=address,
password=password,
run_hours=1)
现在,程序将启动一个 while
循环,定期从网页请求信息。您可以通过更改 run_hours
变量来设置超时(以小时为单位)。您还可以通过在 stock_check_listener
内更改 N
来设置 sleep/wait 时间(以分钟为单位)。我在这种情况下使用 gmail
,如果您在给自己发送电子邮件时收到错误消息,那么您将需要遵循此 link:https://myaccount.google.com/lesssecureapps,并允许安全性较低的应用程序(您的 python 程序)访问您的 gmail 帐户。