网页抓取,退货通知

web scraping, back in stock notification

我想设置一个 Python 脚本来告诉我产品是否有货。目前它抓取下面的 url 并解析网站的相关部分,但我不知道如何获取这个我称为 stock 的输出变量并将其存储为另一个名为 stock_history 的变量然后 运行 另一行询问库存是否等于 stock_history

我在尝试将 html 数据存储在 stock_history 中时扫描字符串文字错误时也遇到了 EOL。有更好的方法吗?

import requests
from datetime import datetime 
from bs4 import BeautifulSoup
import csv
now = datetime.now()
#enter website address
url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')

soup = BeautifulSoup(url.content,'html')

stock = (soup.find("div", "buy-now-bar-con"))

stock_history = '<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4; cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'


print(stock)

if stock == stock_history 
    print("still not in stock")

首先,EOL 代表 "End of Line",如果 python 不喜欢您定义字符串的方式或使用了一些奇怪的字符,您通常会收到此错误。为避免这种情况,您可以在原始代码中对字符串进行三重引号,如下所示:

stock_history = '''<div class="buy-now-bar-con">
<a class="current" href="antminer_s9_asic_bitcoin_miner.htm?
flag=overview">Overview</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?
flag=specifications">Specification</a>
<a href="antminer_s9_asic_bitcoin_miner.htm?flag=gallery">Gallery</a>
<a class="btn-buy-now" href="javascript:;" style="background:#a7a4a4; 
cursor:not-allowed;" target="_self" title="sold out!">Coming soon</a>
</div>'''

这很丑,所以我去掉了那个大字符串,因为它没有必要。您应该从 stock 变量中获得的唯一信息是产品是否已售罄。为此,您可以将 bs4.element.Tag 转换为 str 类型并使用正则表达式检查是否存在 "sold out!" 子字符串。正则表达式确实在您进行数据抓取、处理文本数据或执行任何形式的 XML 或 HTML 解析时派上用场,因此我鼓励您阅读它们。

更多信息:https://www.regular-expressions.info/

您可以在此处轻松测试 python 正则表达式捕获:https://pythex.org/

这是修改后的代码,它完成了您想要让它做的事情。

import re
import csv
import requests
from datetime import datetime 
from bs4 import BeautifulSoup

def stock_check(url):
    """Function checks url for 'sold out!' substring in url.content"""
    soup = BeautifulSoup(url.content, "lxml") #Need to use lxml parser
    stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
    stock_status = re.findall(r"(sold out!)", str(stock)) #Returns list of captured substring if exists.
    return stock_status[0] # returns "sold out!" from soup string.

now = datetime.now()

url = requests.get('https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm')

if stock_check(url) == "sold out!":
    print(str(now) + ": Still not in stock...")
else:
    print(str(now) + ": Now in stock!")

尝试一下,如果您有任何问题,请告诉我!

编辑:OP 询问如何定期检查网页并包括电子邮件通知。需要对原始解决方案进行一些更改,例如在 requests headers 字段中设置 userAgent 信息。还为 BeautifulSoup 对象切换到 html.parser 而不是 lxml 以正确处理 url.content.

中的 javascript
import re
import time
import smtplib
import requests
from datetime import datetime 
from bs4 import BeautifulSoup

def stock_check(url):
    """Checks url for 'sold out!' substring in buy-now-bar-con"""
    soup = BeautifulSoup(url.content, "html.parser") #Need to use lxml parser
    stock = soup.find("div", "buy-now-bar-con") #Check the html tags for sold out/coming soon info.
    stock_status = re.findall(r"sold out!", str(stock)) #Returns list of captured substring if exists.
    return stock_status # returns "sold out!" from soup string.

def send_email(address, password, message):
    """Send an e-mail to yourself!"""
    server = smtplib.SMTP("smtp.gmail.com", 587) #e-mail server
    server.ehlo()
    server.starttls()
    server.login(address,password) #login
    message = str(message) #message to email yourself
    server.sendmail(address,address,message) #send the email through dedicated server
    return

def stock_check_listener(url, address, password, run_hours):
    """Periodically checks stock information."""
    listen = True # listen boolean
    start = datetime.now() # start time
    while(listen): #while listen = True, run loop
        if "sold out!" in stock_check(url): #check page
            now = datetime.now()
            print(str(now) + ": Not in stock.")
        else:
            message = str(now) + ": NOW IN STOCK!"
            print(message)
            send_email(address, password, message)
            listen = False

        duration = (now - start)
        seconds = duration.total_seconds()
        hours = int(seconds/3600)
        if hours >= run_hours: #check run time
            print("Finished.")
            listen = False

        time.sleep(30*60) #Wait N minutes to check again.    
    return

if __name__=="__main__":

    #Set url and userAgent header for javascript issues.
    page = "https://shop.bitmain.com/antminer_s9_asic_bitcoin_miner.htm"
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
    'Content-Type': 'text/html'}

    #URL request.
    url = requests.get(url=page,
                       headers=headers)

    #Run listener to stream stock checks.
    address = "user@gmail.com" #your email
    password = "user.password" #your email password
    stock_check_listener(url=url,
                         address=address,
                         password=password,
                         run_hours=1) 

现在,程序将启动一个 while 循环,定期从网页请求信息。您可以通过更改 run_hours 变量来设置超时(以小时为单位)。您还可以通过在 stock_check_listener 内更改 N 来设置 sleep/wait 时间(以分钟为单位)。我在这种情况下使用 gmail,如果您在给自己发送电子邮件时收到错误消息,那么您将需要遵循此 link:https://myaccount.google.com/lesssecureapps,并允许安全性较低的应用程序(您的 python 程序)访问您的 gmail 帐户。