如何为需要订购的多个值抓取网站

Question

我正在尝试使用 beautifulsoup 抓取 NHL 比赛的结果，但我无法弄清楚如何按顺序获取比赛的日期和结果。比赛日期在标签下，结果在 class "field-content" 中。目前我能够找到这两个值并将它们放在自变量中，但我想保持它们在原始网站中出现的顺序并将数据放在一个变量中。

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen("https://www.jatkoaika.com/nhl/ottelut").read()

soup = bs.BeautifulSoup(sauce, features="html.parser")

dates = str(soup.find_all("h3"))
dates = dates.replace("<h3>", "").replace("</h3>", "")

games = str(soup.find_all("span", {"class": "field-content"}))
games = games.replace('<span class="field-content">', "").replace("</span>", "")

Answer 1

解析此站点的难点在于缺少 header 元素的层次结构和您要解析的游戏。它们都是同一个元素的内容。

使用下面的 CSS 选择器将 h3 元素和带有 field-content class 的跨度放入一个数组中

games = soup.select("h3, span.field-content")

输出：

[<h3>Ma 28.10.2019 runkosarja</h3>,
 <span class="field-content">Chicago - Los Angeles</span>,
 <span class="field-content">NY Islanders - Philadelphia</span>,
 <span class="field-content">NY Rangers - Boston</span>,
 <span class="field-content">Ottawa - San Jose</span>,
 <span class="field-content">Vegas - Anaheim</span>,
 <h3>Ti 29.10.2019 runkosarja</h3>,
 ...
]

现在您可以使用以下代码将游戏分组到日期

from collections import defaultdict
dates_with_games = defaultdict(list)
for e in games:
    if (e.name == 'h3'):
        latestH3 = e.text
    else:
        dates_with_games[latestH3].append(e.text)

你得到的字典看起来像这样

 {'Ma 28.10.2019 runkosarja': 
  ['Chicago - Los Angeles',
   'NY Islanders - Philadelphia',
   'NY Rangers - Boston',
   'Ottawa - San Jose',
   'Vegas - Anaheim'],
  'Ti 29.10.2019 runkosarja': 
    ['Buffalo - Arizona',
     'Vancouver - Florida'],...
 }

如何为需要订购的多个值抓取网站

How to scrape a website for multiple values that need to be ordered

python

beautifulsoup

scrape