无法在非常大的字符串中查找链接
Trouble finding links in very large string
我正在为一个数据科学项目抓取棒球参考,在尝试从特定联盟抓取球员数据时遇到了一个问题。一个本赛季刚刚开始比赛的联赛。当我刮掉已经结束比赛的旧联赛时,我没有任何问题。但我想在这个 link: https://www.baseball-reference.com/register/league.cgi?id=c346199a 随着赛季的进行而活着。然而 link 隐藏在许多看似纯文本的内容之后。所以 BeautifulSoup.find_all('a', href = True) 不起作用。
所以这就是我到目前为止的思考过程。
html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')
team_links = []
for i in range(count):
# rn finds the same one over and over
link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
# try to remove it from orig_ind and do the next link...
# this is the part that is not working rn
orig_ind = orig_ind.replace(link, '')
team_links.append('https://baseball-reference.com' + link)
输出:
['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
等等。我正在尝试从此页面获取 link 团队的所有成员:https://www.baseball-reference.com/register/league.cgi?id=c346199a
然后抓取每个页面上的播放器 link 并收集一些数据。就像我说的,它几乎适用于我尝试过的每一个联赛,除了这个。
非常感谢任何帮助。
您在此站点上看到的表格存储在 HTML 评论 (<!-- ... -->
) 中,因此 BeautifulSoup 通常看不到它们。要解析它们,请尝试下一个示例:
import requests
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(
requests.get(
"https://www.baseball-reference.com/register/league.cgi?id=c346199a"
).text,
features="html.parser",
)
s = "".join(c for c in soup.find_all(text=Comment) if "table_container" in c)
soup = BeautifulSoup(s, "html.parser")
for a in soup.select('[href*="/register/team.cgi?id="]'):
print("{:<30} {}".format(a.text, a["href"]))
打印:
Battle Creek Bombers /register/team.cgi?id=f3c4b615
Kenosha Kingfish /register/team.cgi?id=71fe19cd
Kokomo Jackrabbits /register/team.cgi?id=8f1a41fc
Rockford Rivets /register/team.cgi?id=9f4fe2ef
Traverse City Pit Spitters /register/team.cgi?id=7bc8d111
Kalamazoo Growlers /register/team.cgi?id=9995d2a1
Fond du Lac Dock Spiders /register/team.cgi?id=02911efc
...and so on.
我正在为一个数据科学项目抓取棒球参考,在尝试从特定联盟抓取球员数据时遇到了一个问题。一个本赛季刚刚开始比赛的联赛。当我刮掉已经结束比赛的旧联赛时,我没有任何问题。但我想在这个 link: https://www.baseball-reference.com/register/league.cgi?id=c346199a 随着赛季的进行而活着。然而 link 隐藏在许多看似纯文本的内容之后。所以 BeautifulSoup.find_all('a', href = True) 不起作用。
所以这就是我到目前为止的思考过程。
html = BeautifulSoup(requests.get('https://www.baseball-reference.com/register/league.cgi?id=c346199a').text, features = 'html.parser').find_all('div')
ind = [str(div) for div in html][0]
orig_ind = ind[ind.find('/register/team.cgi?id='):]
count = orig_ind.count('/register/team.cgi?id=')
team_links = []
for i in range(count):
# rn finds the same one over and over
link = orig_ind[orig_ind.find('/register/team.cgi?id='):orig_ind.find('title')].strip().replace('"', '')
# try to remove it from orig_ind and do the next link...
# this is the part that is not working rn
orig_ind = orig_ind.replace(link, '')
team_links.append('https://baseball-reference.com' + link)
输出:
['https://baseball-reference.com/register/team.cgi?id=71fe19cd',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
'https://baseball-reference.com',
等等。我正在尝试从此页面获取 link 团队的所有成员:https://www.baseball-reference.com/register/league.cgi?id=c346199a
然后抓取每个页面上的播放器 link 并收集一些数据。就像我说的,它几乎适用于我尝试过的每一个联赛,除了这个。
非常感谢任何帮助。
您在此站点上看到的表格存储在 HTML 评论 (<!-- ... -->
) 中,因此 BeautifulSoup 通常看不到它们。要解析它们,请尝试下一个示例:
import requests
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(
requests.get(
"https://www.baseball-reference.com/register/league.cgi?id=c346199a"
).text,
features="html.parser",
)
s = "".join(c for c in soup.find_all(text=Comment) if "table_container" in c)
soup = BeautifulSoup(s, "html.parser")
for a in soup.select('[href*="/register/team.cgi?id="]'):
print("{:<30} {}".format(a.text, a["href"]))
打印:
Battle Creek Bombers /register/team.cgi?id=f3c4b615
Kenosha Kingfish /register/team.cgi?id=71fe19cd
Kokomo Jackrabbits /register/team.cgi?id=8f1a41fc
Rockford Rivets /register/team.cgi?id=9f4fe2ef
Traverse City Pit Spitters /register/team.cgi?id=7bc8d111
Kalamazoo Growlers /register/team.cgi?id=9995d2a1
Fond du Lac Dock Spiders /register/team.cgi?id=02911efc
...and so on.