Beautifulsoup4 - 尝试使用 while 循环获取数据

Question

我已经编写了一个非常基本的爬虫来检查 link 是否已死或未使用 bs4。我想检查锚标记是否有 href（这样我可以检查 link 是否处于活动状态）。页面中只有一个锚元素。

这是代码：

import requests
from bs4 import BeautifulSoup

def check():
    url = 'https://somewebsite.net/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,'html.parser')
    for a in soup.findAll('a'):
        href = a.get('href')
        if href != '':
            print('a')
        else:
            print('b')
check()

这很好用，但我希望抓取工具每隔几秒检查一次网站。我尝试使用无限 while 循环来实现它，但没有得到任何结果。

while True:
    check()

我想知道为什么这不起作用以及任何可能的解决方案。
谢谢。

Answer 1

很难说不知道您要在哪个网站上检查标签，但至少从代码的角度来看，如果您的objective是看anchor有没有href属性。

此外，您可能打算将 if 语句作为 for 循环的一部分。

import requests
from bs4 import BeautifulSoup

def check():
    url = 'https://somewebsite.net/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,'html.parser')
    for a in soup.findAll('a'):
        if a.has_attr('href'):
            print('b')
        else:
            print('a')
check()

您可能还想每隔几秒就睡一会儿，而不是一直这样。

import time
...
while True:
    time.sleep(5) # sleep for 5 seconds
    check()

Answer 2

a.get('href') 永远不会等于空字符串，如果锚点没有 href，它将是 None 所以你永远不会计算为真的。

如果只有一个锚点那么就 return soup.find("a", href=True)，如果有一个带 href 的锚点你会得到锚点，如果没有你会得到 None,

import requests
from bs4 import BeautifulSoup
from time import sleep

def check():
    url = 'https://somewebsite.net/'
    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,'html.parser')
    return soup.find("a", href=True)


while True:
   a = check()
   if a:
       # do whatever
    sleep(10)

Beautifulsoup4 - 尝试使用 while 循环获取数据

Beautifulsoup4 - Trying to get the data using a while loop

python

beautifulsoup

bs4