币安公告页面上的 Beautiful Soup Web Scraper 滞后 5 分钟

Question

我使用 bs4 构建了一个网络抓取工具，目的是在 posted 发布新公告时收到通知。目前我正在用 'list' 这个词而不是所有公告关键字来测试它。出于某种原因，当我比较确定新公告已 posted 的时间与它在网站上 posted 的实际时间时，时间相差 5 分钟。

from bs4 import BeautifulSoup
from requests import get
import time
import sys

x = True
while x == True:
    time.sleep(30)
    # Data for the scraping
    url = "https://www.binance.com/en/support/announcement"
    response = get(url)
    html_page = response.content
    soup = BeautifulSoup(html_page, 'html.parser')
    news_list = soup.find_all(class_ = 'css-qinc3w')

    # Create a bag of key words for getting matches
    key_words = ['list', 'token sale', 'open trading', 'opens trading', 'perpetual', 'defi', 'uniswap', 'airdrop', 'adds', 'updates', 'enabled', 'trade', 'support']

    # Empty list
    updated_list = []

    for news in news_list:
        article_text = news.text

        if ("list" in article_text.lower()):
            updated_list.append([article_text])

        if len(updated_list) > 4:
            print(time.asctime( time.localtime(time.time()) ))
            print(article_text)
            sys.exit()

列表长度增加1到5时的Response导致打印如下时间，以及新公告： 2021 年 5 月 28 日星期五 04:17:39， Binance 将上线 Livepeer (LPT)

我不确定这是为什么。起初我以为我被节流了，但再看看 robot.txt，我没有看到我应该被节流的任何理由。此外，我包括了 30 秒的休眠时间，这应该足以毫无问题地进行网络抓取。非常感谢任何帮助或替代解决方案。

我的问题是：

为什么晚了 5 分钟？为什么它在网站 post 上线后不通知我呢？与在网站上 posted 的时间相比，该程序需要 5 分钟的时间来识别新的 post。

Answer 1

from xrzz import http ## give it try using my simple scratch module
import json

url = "https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30"

req = http("GET", url=url, tls=True).body().decode()

key_words = ['list', 'token sale', 'open trading', 'opens trading', 'perpetual', 'defi', 'uniswap', 'airdrop', 'adds', 'updates', 'enabled', 'trade', 'support']

for i in json.loads(req)['data']['catalogs']:
    for o in i['articles']:
        if key_words[0] in o['title']:
            print(o['title'])

输出：

Answer 2

我认为问题是cloudflare服务器正在缓存文档或者它是由 binance 程序员故意完成的，以便一小部分人可以比其他人更快地对新闻做出反应。

如果你想获得新的数据，这是个大问题。如果您查看 HTTP headers，您会注意到 Date: header 被服务器缓存，这意味着文档的全部内容都被缓存了。如果我添加或删除 gzip header，我设法得到 2 个不同的 Date:。 accept-encoding: gzip, deflate.

我正在使用该页面

https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15

如果您更改 pageSize 参数，您可以从服务器获得新的缓存响应。但这仍然没有解决 5 分钟延迟问题。我仍然看到旧页面。

您的 link 是：

https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30 like mine https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15

也缓存了5秒。我猜也会有 5 分钟的延迟。我还没有找到解决这个问题的方法。

币安公告页面上的 Beautiful Soup Web Scraper 滞后 5 分钟

Beautiful Soup Web Scraper on Binance Announcement Page Lags behind by 5 minutes

beautifulsoup

web-scraping

python-3.x

binance