使用 for 循环在 pandas 数据框中创建一个新列

Question

我一直在尝试创建一个网络爬虫来从一个名为 Baseball Reference 的网站上抓取数据。在定义我的爬虫时，我意识到不同的玩家在他们的 URL 末尾有一个唯一的 ID，其中包含他们姓氏的前 6 个字母、三个零和他们名字的前 3 个字母。

我有一个 pandas 数据框已经包含 'first' 和 'last' 列，其中包含每个玩家的名字和姓氏以及我从同一网站下载的许多其他数据。

到目前为止，我的爬虫功能定义如下：

def bbref_crawler(ID):
    url = 'https://www.baseball-reference.com/register/player.fcgi?id=' + str(ID)
    source_code = requests.get(url)
    page_soup = soup(source_code.text, features='lxml')

到目前为止，我尝试获取玩家 ID 的代码如下：

for x in nwl_offense:
    while len(nwl_offense['last']) > 6:
        id_last = len(nwl_offense['last']) - 1
    while len(nwl_offense['first']) > 3:
        id_first = len(nwl_offense['first']) - 1
    nwl_offense['player_id'] = (str(id_first) + '000' + str(id_last))

当我运行 for / while 循环它永远不会停止运行ning 并且我不确定如何才能实现我设定的将玩家 ID 自动化到该数据框的另一列，因此我可以轻松地使用爬虫获取项目所需的有关玩家的更多信息。

这是数据框的前 5 行，nwl_offense 看起来像：

print(nwl_offense.head())
Rk            Name   Age     G  ...         WRC+        WRC   

    WSB     OWins
0  1.0     Brian Baker  20.0  14.0  ...   733.107636   2.007068  0.099775  0.189913
1  2.0    Drew Beazley  21.0  46.0  ...   112.669541  29.920766 -0.456988  2.655892
2  3.0  Jarrett Bickel  21.0  33.0  ...    85.017293  15.245547  1.419822  1.502232
3  4.0      Nate Boyle  23.0  21.0  ...  1127.591556   1.543534  0.000000  0.139136
4  5.0    Seth Brewer*  22.0  12.0  ...   243.655365   1.667671  0.099775  0.159319

Answer 1

如评论中所述，我不会尝试创建一个函数来创建 ID，因为其中可能会有一些“古怪”的 ID 可能不遵循该逻辑。

如果您只是搜索每个字母，他们会将其除以并由玩家直接获取 ID url。

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.baseball-reference.com/register/player.fcgi'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

player_register_search = {}
searchLinks = soup.find('div', {'id':'div_players'}).find_all('li')
for each in searchLinks:
    links = each.find_all('a', href=True)
    for link in links:
        print(link)
        player_register_search[link.text] = 'https://www.baseball-reference.com/' + link['href']
        

tot = len(player_register_search)
playerIds = {}
for count, (k, link)in enumerate(player_register_search.items(), start=1):
    print(f'{count} of {tot} - {link}')
    
    response = requests.get(link)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    kLower = k.lower()
    playerSection = soup.find('div', {'id':f'all_players_{kLower}'})
    
    h2 = playerSection.find('h2').text
    #print('\t',h2)
    
    player_links = playerSection.find_all('a', href=True)
    for player in player_links:
        playerName = player.text.strip()
        playerId = player['href'].split('id=')[-1].strip()
        
        if playerName not in playerIds.keys():
            playerIds[playerName] = []
            
        #print(f'\t{playerName}: {playerId}')
        playerIds[playerName].append(playerId)



df = pd.DataFrame({'Player' : list(playerIds.keys()),
                   'id': list(playerIds.values())})

输出：

print(df)
                 Player              id
0          Scott A'Hara  [ahara-000sco]
1               A'Heasy  [ahease001---]
2             Al Aaberg  [aaberg001alf]
3          Kirk Aadland  [aadlan001kir]
4            Zach Aaker  [aaker-000zac]
                ...             ...
323628      Mike Zywica  [zywica001mic]
323629  Joseph Zywiciel  [zywici000jos]
323630    Bobby Zywicki  [zywick000bob]
323631  Brandon Zywicki  [zywick000bra]
323632       Nate Zyzda  [zyzda-000nat]

[323633 rows x 2 columns]

从您的数据框中获取玩家：

这只是您的数据框的一个示例。不要在您的代码中包含此内容

# Sample of the dataframe
nwl_offense = pd.DataFrame({'first':['Evan', 'Kelby'],
                            'last':['Albrecht', 'Golladay']})

使用这个：

# YOU DATAFRAME - GET LIST OF NAMES
player_interest_list = list(nwl_offense['Name'])


nwl_players = df.loc[df['Player'].isin(player_interest_list)]

输出：

print(nwl_players)
                Player                            id
3095     Evan Albrecht  [albrec001eva, albrec000eva]
108083  Kelby Golladay                [gollad000kel]

使用 for 循环在 pandas 数据框中创建一个新列

Using a for loop to create a new column in pandas dataframe

for-loop

web-crawler

while-loop

web-scraping

pandas