从使用 BeautifulSoup 解析的不等数组创建 Pandas 数据框
Creating Pandas Dataframe from Unequal Arrays Parsed using BeautifulSoup
我正在尝试生成一个包含两列的数据框。第一列包含足球联赛的名称。第二列是包含这些联赛中球队的名称。
我可以抓取和解析数据,但因为每个联赛名称都有多个球队名称,所以我不断得到 ValueError: arrays must all be same length
。
这是我的代码:
league_names = soup.find_all(class_='panel-title')
team_names = soup.find_all('a', class_="odds")
a = [data.text.strip() for data in league_names]
b = [data.text.strip() for data in team_names]
df = pd.DataFrame({'league_names':a, 'team_names':b}, columns=['league_names','team_names'])
这是所需的输出:
league_names
team_names
Albania Championship
Dinamo Tirana - Skenderbeu Korce
Albania Championship
KF Teuta - FK Egnatia
Albania Championship
Vllaznia Shkoder - FK Kukesi
这是 html 的屏幕截图(代码本身在下面,但即使按照 these 说明我似乎也无法正确粘贴它)。
html:
<div class="panel">
<div class="panel-heading">
<h4 class="panel-title">
<a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
Albania Championship </a>
</h4>
</div>
<div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">
<ul class="nav list-group">
<li>
<a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
</li>
</ul>
</div>
</div>
为避免出现长度可能不同的多个列表,请尝试更改您的抓取策略。根据您的示例 select 所有 <a>
和 class odds
来自面板并将它们与之前的 <h4>
:
组合
data = []
for l in soup.select('div.panel a.odds'):
data.append({
'league':l.find_previous('h4').text.strip(),
'teams':l.text
})
例子
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4 class="panel-title">
<a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
Albania Championship </a>
</h4>
</div>
<div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">
<ul class="nav list-group">
<li>
<a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
</li>
</ul>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for l in soup.select('div.panel a.odds'):
data.append({
'league':l.find_previous('h4').text.strip(),
'teams':l.text
})
pd.DataFrame(data)
输出
league
teams
Albania Championship
Dinamo Tirana - Skenderbeu Korce
Albania Championship
KF Teuta - FK Egnatia
Albania Championship
Vllaznia Shkoder - FK Kukesi
我正在尝试生成一个包含两列的数据框。第一列包含足球联赛的名称。第二列是包含这些联赛中球队的名称。
我可以抓取和解析数据,但因为每个联赛名称都有多个球队名称,所以我不断得到 ValueError: arrays must all be same length
。
这是我的代码:
league_names = soup.find_all(class_='panel-title')
team_names = soup.find_all('a', class_="odds")
a = [data.text.strip() for data in league_names]
b = [data.text.strip() for data in team_names]
df = pd.DataFrame({'league_names':a, 'team_names':b}, columns=['league_names','team_names'])
这是所需的输出:
league_names | team_names |
---|---|
Albania Championship | Dinamo Tirana - Skenderbeu Korce |
Albania Championship | KF Teuta - FK Egnatia |
Albania Championship | Vllaznia Shkoder - FK Kukesi |
这是 html 的屏幕截图(代码本身在下面,但即使按照 these 说明我似乎也无法正确粘贴它)。
html:
<div class="panel">
<div class="panel-heading">
<h4 class="panel-title">
<a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
Albania Championship </a>
</h4>
</div>
<div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">
<ul class="nav list-group">
<li>
<a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
</li>
</ul>
</div>
</div>
为避免出现长度可能不同的多个列表,请尝试更改您的抓取策略。根据您的示例 select 所有 <a>
和 class odds
来自面板并将它们与之前的 <h4>
:
data = []
for l in soup.select('div.panel a.odds'):
data.append({
'league':l.find_previous('h4').text.strip(),
'teams':l.text
})
例子
from bs4 import BeautifulSoup
html = '''
<div class="panel">
<div class="panel-heading">
<h4 class="panel-title">
<a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
Albania Championship </a>
</h4>
</div>
<div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">
<ul class="nav list-group">
<li>
<a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
</li>
<li>
<a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
</li>
</ul>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html)
data = []
for l in soup.select('div.panel a.odds'):
data.append({
'league':l.find_previous('h4').text.strip(),
'teams':l.text
})
pd.DataFrame(data)
输出
league | teams |
---|---|
Albania Championship | Dinamo Tirana - Skenderbeu Korce |
Albania Championship | KF Teuta - FK Egnatia |
Albania Championship | Vllaznia Shkoder - FK Kukesi |