BeautifulSoup 没有找到全部 'th'
BeautifulSoup Not Finding All 'th'
我目前正在尝试使用 BeautifulSoup 在 Python 3.7 中抓取一个统计站点。我试图从 table 中获取所有 headers 作为我的专栏 headers,但由于某些原因 BeautifulSoup 没有获取所有 headers 位于 'th' 标签内。
这是我的代码:
url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
html = urlopen(url)
scraper = BeautifulSoup(html,'html.parser')
column_headers = [th.getText() for th in scraper.findAll('tr', limit=1)[0].findAll('th')] # Find Column Headers.
print(column_headers)
这是我得到的输出:
['#', 'Player', 'GP', 'G', 'A', 'TP']
这是我应该得到的输出:
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']
供参考,这里是 table 来源 html 的样子:
<table class="table table-striped table-sortable skater-stats highlight-stats" data-sort-url="https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats" data-sort-ajax-container="#players" data-sort-ajax-url="https://www.eliteprospects.com/ajax/team.player-stats?teamId=552&season=2005-2006&position=">
<thead style="background-color: #fff">
<tr style="background-color: #fff">
<th class="position">#</th>
<th class="player sorted" data-sort="player">Player<i class="fa fa-caret-down"></i></th>
<th class="gp" data-sort="gp">GP</th>
<th class="g" data-sort="g">G</th>
<th class="a" data-sort="a">A</th>
<th class="tp" data-sort="tp">TP</th>
<th class="pim" data-sort="pim">PIM</th>
<th class="pm" data-sort="pm">+/-</th>
<th class="separator"> </th>
<th class="playoffs gp" data-sort="playoffs-gp">GP</th>
<th class="playoffs g" data-sort="playoffs-g">G</th>
<th class="playoffs a" data-sort="playoffs-a">A</th>
<th class="playoffs tp" data-sort="playoffs-tp">TP</th>
<th class="playoffs pim" data-sort="playoffs-pim">PIM</th>
<th class="playoffs pm" data-sort="playoffs-pm">+/-</th>
</tr>
</thead>
<tbody>
如有任何帮助,我们将不胜感激!
查看您要抓取的页面的来源,这正是数据的样子:
<div class="table-wizard">
<table class="table table-striped">
<thead>
<tr>
<th class="position">#</th>
<th class="player">Player</th>
<th class="gp">GP</th>
<th class="g">G</th>
<th class="a">A</th>
<th class="sorted tp">TP</th>
</tr>
</thead>
<tbody>
这就是为什么这是您获得的唯一数据。它甚至不是 JavaScript 在事后更改它的情况。如果我在浏览器控制台中执行 querySelector
,我会得到同样的结果:
> document.querySelector('tr')
> <tr>
<th class="position">#</th>
<th class="player">Player</th>
<th class="gp">GP</th>
<th class="g">G</th>
<th class="a">A</th>
<th class="sorted tp">TP</th>
</tr>
简而言之,Beautiful Soup 在第一个 tr
标签中为您提供了所有 th
标签。
如果您尝试使用 CSS 选择器 tr:has(th)
获取具有 th
标签的第二个 tr
标签,您会发现您得到更多 th
标签:
column_headers = [th.getText() for th in scraper.select('tr:has(th)', limit=2)[1].findAll('th')]
输出
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', '\xa0', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']
由于标签是 <table>
,让 pandas 为您完成工作(它在后台使用 bs4)。然后你可以根据需要进行简单的操作:
import pandas as pd
url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
dfs = pd.read_html(url)
headers = list(dfs[1].columns)
print(headers)
输出:
print(headers)
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'Unnamed: 8', 'GP.1', 'G.1', 'A.1', 'TP.1', 'PIM.1', '+/-.1']
我目前正在尝试使用 BeautifulSoup 在 Python 3.7 中抓取一个统计站点。我试图从 table 中获取所有 headers 作为我的专栏 headers,但由于某些原因 BeautifulSoup 没有获取所有 headers 位于 'th' 标签内。
这是我的代码:
url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
html = urlopen(url)
scraper = BeautifulSoup(html,'html.parser')
column_headers = [th.getText() for th in scraper.findAll('tr', limit=1)[0].findAll('th')] # Find Column Headers.
print(column_headers)
这是我得到的输出: ['#', 'Player', 'GP', 'G', 'A', 'TP']
这是我应该得到的输出: ['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']
供参考,这里是 table 来源 html 的样子:
<table class="table table-striped table-sortable skater-stats highlight-stats" data-sort-url="https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats" data-sort-ajax-container="#players" data-sort-ajax-url="https://www.eliteprospects.com/ajax/team.player-stats?teamId=552&season=2005-2006&position=">
<thead style="background-color: #fff">
<tr style="background-color: #fff">
<th class="position">#</th>
<th class="player sorted" data-sort="player">Player<i class="fa fa-caret-down"></i></th>
<th class="gp" data-sort="gp">GP</th>
<th class="g" data-sort="g">G</th>
<th class="a" data-sort="a">A</th>
<th class="tp" data-sort="tp">TP</th>
<th class="pim" data-sort="pim">PIM</th>
<th class="pm" data-sort="pm">+/-</th>
<th class="separator"> </th>
<th class="playoffs gp" data-sort="playoffs-gp">GP</th>
<th class="playoffs g" data-sort="playoffs-g">G</th>
<th class="playoffs a" data-sort="playoffs-a">A</th>
<th class="playoffs tp" data-sort="playoffs-tp">TP</th>
<th class="playoffs pim" data-sort="playoffs-pim">PIM</th>
<th class="playoffs pm" data-sort="playoffs-pm">+/-</th>
</tr>
</thead>
<tbody>
如有任何帮助,我们将不胜感激!
查看您要抓取的页面的来源,这正是数据的样子:
<div class="table-wizard">
<table class="table table-striped">
<thead>
<tr>
<th class="position">#</th>
<th class="player">Player</th>
<th class="gp">GP</th>
<th class="g">G</th>
<th class="a">A</th>
<th class="sorted tp">TP</th>
</tr>
</thead>
<tbody>
这就是为什么这是您获得的唯一数据。它甚至不是 JavaScript 在事后更改它的情况。如果我在浏览器控制台中执行 querySelector
,我会得到同样的结果:
> document.querySelector('tr')
> <tr>
<th class="position">#</th>
<th class="player">Player</th>
<th class="gp">GP</th>
<th class="g">G</th>
<th class="a">A</th>
<th class="sorted tp">TP</th>
</tr>
简而言之,Beautiful Soup 在第一个 tr
标签中为您提供了所有 th
标签。
如果您尝试使用 CSS 选择器 tr:has(th)
获取具有 th
标签的第二个 tr
标签,您会发现您得到更多 th
标签:
column_headers = [th.getText() for th in scraper.select('tr:has(th)', limit=2)[1].findAll('th')]
输出
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', '\xa0', 'GP', 'G', 'A', 'TP', 'PIM', '+/-']
由于标签是 <table>
,让 pandas 为您完成工作(它在后台使用 bs4)。然后你可以根据需要进行简单的操作:
import pandas as pd
url = 'https://www.eliteprospects.com/team/552/guelph-storm/2005-2006?tab=stats'
dfs = pd.read_html(url)
headers = list(dfs[1].columns)
print(headers)
输出:
print(headers)
['#', 'Player', 'GP', 'G', 'A', 'TP', 'PIM', '+/-', 'Unnamed: 8', 'GP.1', 'G.1', 'A.1', 'TP.1', 'PIM.1', '+/-.1']