如何递归抓取网页以检查 python 中是否有新的 pdf 文件？

Question

有一个网站每个月都会发布一份 pdf 报告。我想每小时监控一次，并在新 pdf 上传时将新 pdf 通过电子邮件发送到我的电子邮件中。我想为它使用 python 。我也熟悉 beautiful soup 和 scrapy，但我不知道如何检查新的 pdf 文件，只能获取新的 pdf 文件。

Answer 1

import requests
from time import strftime, sleep

# Here I Will Get The Current Month And Year And Assign It To Variables.

month = strftime("%B").lower()
year = strftime("%Y")

# I've noticed that the website publishing the file by the Month name in lower-case and year.
# Now we will loop on the url each one hour and download the file once released and exit.
# You Have to set cron-job to run every month.


while True:
    r = requests.get(
        f"http://www.pbs.gov.pk/sites/default/files//price_statistics/monthly_price_indices/nb/2019/cpi_review_nb_{month}_{year}.pdf")
    if r.status_code == 200:
        with open(f"{month}.pdf", 'wb') as f:
            f.write(r.content)
            print(f"File Already Saved As {month}.pdf")
            break
    else:
        print("File Not Raised Yet, We Will Check Back After One Hour.")
        sleep(3600)
        continue

# you can also put the current href links in a file.
# then loop over source and check if new href is released so download and append the href to file.
# but if the href in file so keep checking each hour.

如何递归抓取网页以检查 python 中是否有新的 pdf 文件？

How to recursively scrape a web page to check if there are new pdf files in python?

python

screen-scraping