在汤中找不到 table 数据,但我知道它在那里
Cannot find the table data within the soup, but I know its there
我正在尝试创建一个功能来为项目抓取大学棒球队名册页面。我创建了一个功能来抓取花名册页面,获取我想要抓取的 link 的列表。但是当我尝试为每个玩家抓取个人 links 时,它有效但找不到他们页面上的数据。
这是 link 我开始抓取的页面:
https://gvsulakers.com/sports/baseball/roster
这些只是我在遇到问题的函数中调用的函数:
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def scrape(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
row = soop.find_all('tr')
lopr = [parse_row(rows) for rows in row]
return(lopr)
这是我遇到的问题。当我为 type1_roster 分配一个变量并打印它时,我只得到一个空列表。理想情况下,它应该包含来自球员名册页面的一名或多名球员的数据。
# Roster page crawler
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
href_tags = soop.find_all(href = True)
hrefs = [tag.get('href') for tag in href_tags]
# get all player links
player_hrefs = []
for href in hrefs:
if 'sports/baseball/roster' in href:
if 'sports/baseball/roster/coaches' not in href:
if 'https:' not in href:
player_hrefs.append(href)
# get rid of duplicates
player_links = list(set(player_hrefs))
# scrape the roster links
for link in player_links:
player_ = url + link[24:]
return(find_data(player_))
一些事情:
- 我会将 headers 作为全局
- 你切 1 个字符太晚了 link 我认为
player_
- 您需要 re-work
find_data()
的逻辑,因为数据存在于元素类型的混合中,而不是 table/tr/td 元素中,例如在跨度中找到。 html 属性很好且具有描述性,将支持轻松定位内容
- 您可以使用下面显示的 css 选择器列表更紧密地从着陆页定位玩家 link。这消除了对多个循环的需要以及
list(set())
的使用
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent': 'Mozilla/5.0'}
def scrape(url):
page = requests.get(url, headers=HEADERS)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers=HEADERS)
#print(page)
html = page.text
soop = BeautifulSoup(html, 'lxml')
# re-think logic here to return desired data e.g.
# soop.select_one('.sidearm-roster-player-jersey-number').text
first_name = soop.select_one('.sidearm-roster-player-first-name').text
# soop.select_one('.sidearm-roster-player-last-name').text
# need targeted string cleaning possibly
bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
return (first_name, bio)
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
player_links = [i['href'] for i in soop.select(
'.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
# scrape the roster links
for link in player_links:
player_ = url + link[23:]
# print(player_)
return(find_data(player_))
print(type1_roster('gvsulakers'))
我正在尝试创建一个功能来为项目抓取大学棒球队名册页面。我创建了一个功能来抓取花名册页面,获取我想要抓取的 link 的列表。但是当我尝试为每个玩家抓取个人 links 时,它有效但找不到他们页面上的数据。
这是 link 我开始抓取的页面:
https://gvsulakers.com/sports/baseball/roster
这些只是我在遇到问题的函数中调用的函数:
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def scrape(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
row = soop.find_all('tr')
lopr = [parse_row(rows) for rows in row]
return(lopr)
这是我遇到的问题。当我为 type1_roster 分配一个变量并打印它时,我只得到一个空列表。理想情况下,它应该包含来自球员名册页面的一名或多名球员的数据。
# Roster page crawler
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
href_tags = soop.find_all(href = True)
hrefs = [tag.get('href') for tag in href_tags]
# get all player links
player_hrefs = []
for href in hrefs:
if 'sports/baseball/roster' in href:
if 'sports/baseball/roster/coaches' not in href:
if 'https:' not in href:
player_hrefs.append(href)
# get rid of duplicates
player_links = list(set(player_hrefs))
# scrape the roster links
for link in player_links:
player_ = url + link[24:]
return(find_data(player_))
一些事情:
- 我会将 headers 作为全局
- 你切 1 个字符太晚了 link 我认为
player_
- 您需要 re-work
find_data()
的逻辑,因为数据存在于元素类型的混合中,而不是 table/tr/td 元素中,例如在跨度中找到。 html 属性很好且具有描述性,将支持轻松定位内容 - 您可以使用下面显示的 css 选择器列表更紧密地从着陆页定位玩家 link。这消除了对多个循环的需要以及
list(set())
的使用
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent': 'Mozilla/5.0'}
def scrape(url):
page = requests.get(url, headers=HEADERS)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers=HEADERS)
#print(page)
html = page.text
soop = BeautifulSoup(html, 'lxml')
# re-think logic here to return desired data e.g.
# soop.select_one('.sidearm-roster-player-jersey-number').text
first_name = soop.select_one('.sidearm-roster-player-first-name').text
# soop.select_one('.sidearm-roster-player-last-name').text
# need targeted string cleaning possibly
bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
return (first_name, bio)
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
player_links = [i['href'] for i in soop.select(
'.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
# scrape the roster links
for link in player_links:
player_ = url + link[23:]
# print(player_)
return(find_data(player_))
print(type1_roster('gvsulakers'))