在另一个元素下搜索 class 元素

Question

我收集每日阵容，需要查明是否有球队没有发布阵容。在本例中，有一个名为 lineup__no 的 class 元素。我想查看每个团队并检查是否发布了阵容，如果没有，将该团队索引添加到列表中。例如，如果有 4 支球队比赛，而第一和第三支球队没有发布阵容，我想 return 一个 [0,2] 的列表。我猜某种列表理解可能会帮助我到达那里，但很难想出我需要的东西。我现在尝试了一个 for 循环来获取主 header 下的每个项目。我也试过将每个 li 项目的文本添加到列表中并搜索“Unknown Lineup”但没有成功。

from selenium import webdriver

from selenium.common.exceptions import NoSuchElementException

from bs4 import BeautifulSoup
import requests
import pandas as pd

#Scraping lineups for updates
url = 'https://www.rotowire.com/baseball/daily-lineups.php'

##Requests rotowire HTML
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

games = soup.select('.lineup.is-mlb')
for game in games:
    initial_list = game.find_all('li')
    print(initial_list)

Answer 1

因为我比较熟悉Selenium，所以给你Selenium的解决方案。
请在作为注释给出的代码中查看我的解释。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

driver = webdriver.Chrome()
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get("https://www.rotowire.com/baseball/daily-lineups.php")
#wait for at least 1 game element to be visible
wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".lineup.is-mlb")))
#add a short delay so that all the other games are loaded
time.sleep(0.5)
#get all the games blocks
games = driver.find_elements(By.CSS_SELECTOR,".lineup.is-mlb")
#iterate over the games elements with their indexes in a list comprehension
no_lineup = [j for idx, game in enumerate(games) for j in [idx*2, idx*2+1] if game.find_elements(By.XPATH, ".//li[@class='lineup__no']")] 


#print the collected results
print(no_lineup)
#quit the driver
driver.quit()

Answer 2

只需查看带有 class="lineup__status" 的 <li> 标签。然后在迭代时使用 enumerate 跟踪列表的索引。我没有一些球队有阵容的例子（我必须稍后检查这里的阵容），所以我可能会改变 if lineupStatus.text.strip() == 'Unknown Lineup' 的逻辑以更稳健。但在我能确切地看到 html 在那个点上的样子之前，我将不得不假设“lineup__no” class 始终存在。但就像我说的，一旦我看到这个页面的一些阵容看起来如何，我就会调整它。

顺便说一句，

The Guardians lineup has not been posted yet.

把我扔在那里一秒钟......完全忘记了！

from bs4 import BeautifulSoup
import requests
import re

#Scraping lineups for updates
url = 'https://www.rotowire.com/baseball/daily-lineups.php'

##Requests rotowire HTML
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")

lineupStatuses = soup.find_all('li', {'class':re.compile('^lineup__status')})


noLineupIndex = []
for idx, lineupStatus in enumerate(lineupStatuses):
    if 'is-confirmed' not in lineupStatus['class']:
        noLineupIndex.append(idx)
        
# Or use list comprehension        
#noLineupIndex = [idx for idx, lineupStatus in enumerate(lineupStatuses) if 'is-confirmed' not in lineupStatus['class']]

输出：

print(noLineupIndex)
[0, 3, 6, 7, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]

在另一个元素下搜索 class 元素

Search for class element underneath another element

python

selenium

xpath

beautifulsoup

web-scraping