从使用 BeautifulSoup 解析的不等数组创建 Pandas 数据框

Creating Pandas Dataframe from Unequal Arrays Parsed using BeautifulSoup

我正在尝试生成一个包含两列的数据框。第一列包含足球联赛的名称。第二列是包含这些联赛中球队的名称。

我可以抓取和解析数据,但因为每个联赛名称都有多个球队名称,所以我不断得到 ValueError: arrays must all be same length

这是我的代码:

league_names = soup.find_all(class_='panel-title')
team_names = soup.find_all('a', class_="odds")

a = [data.text.strip() for data in league_names]
b = [data.text.strip() for data in team_names]

df = pd.DataFrame({'league_names':a, 'team_names':b}, columns=['league_names','team_names'])

这是所需的输出:

league_names team_names
Albania Championship Dinamo Tirana - Skenderbeu Korce
Albania Championship KF Teuta - FK Egnatia
Albania Championship Vllaznia Shkoder - FK Kukesi

这是 html 的屏幕截图(代码本身在下面,但即使按照 these 说明我似乎也无法正确粘贴它)。

html:

<div class="panel">
                                        <div class="panel-heading">
                                            <h4 class="panel-title">
                                                <a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
                                                    Albania Championship                                                </a>
                                            </h4>
                                        </div>
                                        <div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">

                                            <ul class="nav list-group">
                                                                                                    <li>
                                                        <a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
                                                    </li>
                                                                                                    <li>
                                                        <a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
                                                    </li>
                                                                                                    <li>
                                                        <a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
                                                    </li>
                                                
                                            </ul>

                                        </div>
                                    </div>

为避免出现长度可能不同的多个列表,请尝试更改您的抓取策略。根据您的示例 select 所有 <a> 和 class odds 来自面板并将它们与之前的 <h4>:

组合
data = []
for l in soup.select('div.panel a.odds'):
    data.append({
        'league':l.find_previous('h4').text.strip(),
        'teams':l.text
    })

例子

from bs4 import BeautifulSoup 

html = '''
<div class="panel">
<div class="panel-heading">
        <h4 class="panel-title">
            <a class="" title="Click to expand Albania Championship" data-toggle="collapse" href="#_l10041047" aria-expanded="true">
                Albania Championship                                                </a>
        </h4>
    </div>
    <div id="_l10041047" class="panel-collapse collapse in" aria-expanded="true" style="">

        <ul class="nav list-group">
                                                                <li>
                    <a class="odds" onclick="loadEventData('119024520',this)" title="Dinamo Tirana - Skenderbeu Korce">Dinamo Tirana - Skenderbeu Korce</a>
                </li>
                                                                <li>
                    <a class="odds" onclick="loadEventData('119024522',this)" title="KF Teuta - FK Egnatia">KF Teuta - FK Egnatia</a>
                </li>
                                                                <li>
                    <a class="odds" onclick="loadEventData('119024524',this)" title="Vllaznia Shkoder - FK Kukesi">Vllaznia Shkoder - FK Kukesi</a>
                </li>
        </ul>

    </div>
</div>
</div>
'''

soup = BeautifulSoup(html)
data = []
for l in soup.select('div.panel a.odds'):
    data.append({
        'league':l.find_previous('h4').text.strip(),
        'teams':l.text
    })
    
pd.DataFrame(data)
输出
league teams
Albania Championship Dinamo Tirana - Skenderbeu Korce
Albania Championship KF Teuta - FK Egnatia
Albania Championship Vllaznia Shkoder - FK Kukesi