在汤中找不到 table 数据,但我知道它在那里

Cannot find the table data within the soup, but I know its there

我正在尝试创建一个功能来为项目抓取大学棒球队名册页面。我创建了一个功能来抓取花名册页面,获取我想要抓取的 link 的列表。但是当我尝试为每个玩家抓取个人 links 时,它有效但找不到他们页面上的数据。

这是 link 我开始抓取的页面:

https://gvsulakers.com/sports/baseball/roster

这些只是我在遇到问题的函数中调用的函数:

def parse_row(rows):
    return [str(x.string)for x in rows.find_all('td')]

def scrape(url):
  page = requests.get(url, headers = headers)
  html = page.text
  soop = BeautifulSoup(html, 'lxml')
  return(soop)

def find_data(url):
  page = requests.get(url, headers = headers)
  html = page.text
  soop = BeautifulSoup(html, 'lxml')
  row = soop.find_all('tr')
  lopr = [parse_row(rows) for rows in row]
  return(lopr)

这是我遇到的问题。当我为 type1_roster 分配一个变量并打印它时,我只得到一个空列表。理想情况下,它应该包含来自球员名册页面的一名或多名球员的数据。

# Roster page crawler
def type1_roster(team_id):
  url = "https://" + team_id + ".com/sports/baseball/roster"
  soop = scrape(url)
  href_tags = soop.find_all(href = True)
  hrefs = [tag.get('href') for tag in href_tags]
  # get all player links
  player_hrefs = []
  for href in hrefs:
    if 'sports/baseball/roster' in href:
      if 'sports/baseball/roster/coaches' not in href:
        if 'https:' not in href:
          player_hrefs.append(href)
  # get rid of duplicates
  player_links = list(set(player_hrefs))
  # scrape the roster links
  for link in player_links:
    player_ = url + link[24:]
    return(find_data(player_))

一些事情:

  1. 我会将 headers 作为全局
  2. 你切 1 个字符太晚了 link 我认为 player_
  3. 您需要 re-work find_data() 的逻辑,因为数据存在于元素类型的混合中,而不是 table/tr/td 元素中,例如在跨度中找到。 html 属性很好且具有描述性,将支持轻松定位内容
  4. 您可以使用下面显示的 css 选择器列表更紧密地从着陆页定位玩家 link。这消除了对多个循环的需要以及 list(set())
  5. 的使用

import requests
from bs4 import BeautifulSoup

HEADERS = {'User-Agent': 'Mozilla/5.0'}


def scrape(url):
    page = requests.get(url, headers=HEADERS)
    html = page.text
    soop = BeautifulSoup(html, 'lxml')
    return(soop)


def find_data(url):
    page = requests.get(url, headers=HEADERS)
    #print(page)
    html = page.text
    soop = BeautifulSoup(html, 'lxml')
    # re-think logic here to return desired data e.g.
    # soop.select_one('.sidearm-roster-player-jersey-number').text
    first_name = soop.select_one('.sidearm-roster-player-first-name').text
    # soop.select_one('.sidearm-roster-player-last-name').text
    # need targeted string cleaning possibly
    bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
    return (first_name, bio)


def type1_roster(team_id):
    url = "https://" + team_id + ".com/sports/baseball/roster"
    soop = scrape(url)
    player_links = [i['href'] for i in soop.select(
        '.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
    # scrape the roster links
    for link in player_links:
        player_ = url + link[23:]
        # print(player_)
        return(find_data(player_))


print(type1_roster('gvsulakers'))