关于在 python 中使用 bs4 解析 html 的基本问题

Question

我有一个关于 bs4 的可能很简单的问题，但我似乎无法弄清楚。

作为参考，我是自学成才的，正在通过学习解决问题 python。

所以基本上我正在从事的一个更大项目的一部分需要我抓取一个网站以获得 1 个月 T-bill 的最新利率。我能够把它的 99% 下来，但我坚持了其中的一个方面。

基本上这个数据只更新周一至周五。并且运行此代码表示在网站更新前的上午 8 点或周末 returns 出现错误。使用已更新的日期时，我可以获得我需要的确切数据。

因此我将变量d1、d2 和d3 设置为今天、昨天和两天前。我想用我的soup.find搜索今天，如果none搜索昨天，再搜索两天前。

在我的代码中，如果我使用 text=d3，例如，我得到一个返回值。

这是我现在拥有的，非常感谢您的帮助！

from bs4 import BeautifulSoup
import requests
from datetime import date
import datetime

today = date.today()
d1 = today.strftime("%B %d, %Y")
ndays1 = datetime.timedelta(days = 1)
d2 = (today-ndays1).strftime("%B %d, %Y")
ndays2 = datetime.timedelta(days = 2)
d3 = (today-ndays2).strftime("%B %d, %Y")
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'DNT': '1',  # Do Not Track Request Header
    'Connection': 'close'
}

url_rfr = "https://ycharts.com/indicators/1_month_treasury_rate"

response = requests.get(url_rfr, headers=headers, timeout=5).text
soup = BeautifulSoup(response, 'html.parser')

div = soup.find("td", text=d1 or d2 or d3).find_next_sibling("td").text.strip()

r = (float(div[:-1]))

print(r)

Answer 1

因此，我将 find(...) 中的 text 更改为 "Last Value" 并且还添加了 latest_period 抓取以确保完整性

import datetime
from datetime import date

import requests
from bs4 import BeautifulSoup

today = date.today()
d1 = today.strftime("%B %d, %Y")
ndays1 = datetime.timedelta(days=1)
d2 = (today - ndays1).strftime("%B %d, %Y")
ndays2 = datetime.timedelta(days=2)
d3 = (today - ndays2).strftime("%B %d, %Y")

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/71.0.3578.98 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'DNT': '1',  # Do Not Track Request Header
    'Connection': 'close'
}

url_rfr = "https://ycharts.com/indicators/1_month_treasury_rate"

response = requests.get(url_rfr, headers=headers, timeout=5).text

soup = BeautifulSoup(response, 'html.parser')
latest_period = soup.find("td", text="Latest Period").find_next_sibling("td").text.strip()
value = soup.find("td", text="Last Value").find_next_sibling("td").text.strip()

val = (float(value[:-1]))

print(latest_period, val)  # Feb 11 2022 0.03

关于在 python 中使用 bs4 解析 html 的基本问题

Basic question about parsing html using bs4 in python

html

python

parsing

beautifulsoup