使用 for 循环在 pandas 数据框中创建一个新列
Using a for loop to create a new column in pandas dataframe
我一直在尝试创建一个网络爬虫来从一个名为 Baseball Reference 的网站上抓取数据。在定义我的爬虫时,我意识到不同的玩家在他们的 URL 末尾有一个唯一的 ID,其中包含他们姓氏的前 6 个字母、三个零和他们名字的前 3 个字母。
我有一个 pandas 数据框已经包含 'first' 和 'last' 列,其中包含每个玩家的名字和姓氏以及我从同一网站下载的许多其他数据。
到目前为止,我的爬虫功能定义如下:
def bbref_crawler(ID):
url = 'https://www.baseball-reference.com/register/player.fcgi?id=' + str(ID)
source_code = requests.get(url)
page_soup = soup(source_code.text, features='lxml')
到目前为止,我尝试获取玩家 ID 的代码如下:
for x in nwl_offense:
while len(nwl_offense['last']) > 6:
id_last = len(nwl_offense['last']) - 1
while len(nwl_offense['first']) > 3:
id_first = len(nwl_offense['first']) - 1
nwl_offense['player_id'] = (str(id_first) + '000' + str(id_last))
当我 运行 for / while 循环它永远不会停止 运行ning 并且我不确定如何才能实现我设定的将玩家 ID 自动化到该数据框的另一列,因此我可以轻松地使用爬虫获取项目所需的有关玩家的更多信息。
这是数据框的前 5 行,nwl_offense 看起来像:
print(nwl_offense.head())
Rk Name Age G ... WRC+ WRC
WSB OWins
0 1.0 Brian Baker 20.0 14.0 ... 733.107636 2.007068 0.099775 0.189913
1 2.0 Drew Beazley 21.0 46.0 ... 112.669541 29.920766 -0.456988 2.655892
2 3.0 Jarrett Bickel 21.0 33.0 ... 85.017293 15.245547 1.419822 1.502232
3 4.0 Nate Boyle 23.0 21.0 ... 1127.591556 1.543534 0.000000 0.139136
4 5.0 Seth Brewer* 22.0 12.0 ... 243.655365 1.667671 0.099775 0.159319
如评论中所述,我不会尝试创建一个函数来创建 ID,因为其中可能会有一些“古怪”的 ID 可能不遵循该逻辑。
如果您只是搜索每个字母,他们会将其除以并由玩家直接获取 ID url。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.baseball-reference.com/register/player.fcgi'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
player_register_search = {}
searchLinks = soup.find('div', {'id':'div_players'}).find_all('li')
for each in searchLinks:
links = each.find_all('a', href=True)
for link in links:
print(link)
player_register_search[link.text] = 'https://www.baseball-reference.com/' + link['href']
tot = len(player_register_search)
playerIds = {}
for count, (k, link)in enumerate(player_register_search.items(), start=1):
print(f'{count} of {tot} - {link}')
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
kLower = k.lower()
playerSection = soup.find('div', {'id':f'all_players_{kLower}'})
h2 = playerSection.find('h2').text
#print('\t',h2)
player_links = playerSection.find_all('a', href=True)
for player in player_links:
playerName = player.text.strip()
playerId = player['href'].split('id=')[-1].strip()
if playerName not in playerIds.keys():
playerIds[playerName] = []
#print(f'\t{playerName}: {playerId}')
playerIds[playerName].append(playerId)
df = pd.DataFrame({'Player' : list(playerIds.keys()),
'id': list(playerIds.values())})
输出:
print(df)
Player id
0 Scott A'Hara [ahara-000sco]
1 A'Heasy [ahease001---]
2 Al Aaberg [aaberg001alf]
3 Kirk Aadland [aadlan001kir]
4 Zach Aaker [aaker-000zac]
... ...
323628 Mike Zywica [zywica001mic]
323629 Joseph Zywiciel [zywici000jos]
323630 Bobby Zywicki [zywick000bob]
323631 Brandon Zywicki [zywick000bra]
323632 Nate Zyzda [zyzda-000nat]
[323633 rows x 2 columns]
从您的数据框中获取玩家:
这只是您的数据框的一个示例。不要在您的代码中包含此内容
# Sample of the dataframe
nwl_offense = pd.DataFrame({'first':['Evan', 'Kelby'],
'last':['Albrecht', 'Golladay']})
使用这个:
# YOU DATAFRAME - GET LIST OF NAMES
player_interest_list = list(nwl_offense['Name'])
nwl_players = df.loc[df['Player'].isin(player_interest_list)]
输出:
print(nwl_players)
Player id
3095 Evan Albrecht [albrec001eva, albrec000eva]
108083 Kelby Golladay [gollad000kel]
我一直在尝试创建一个网络爬虫来从一个名为 Baseball Reference 的网站上抓取数据。在定义我的爬虫时,我意识到不同的玩家在他们的 URL 末尾有一个唯一的 ID,其中包含他们姓氏的前 6 个字母、三个零和他们名字的前 3 个字母。
我有一个 pandas 数据框已经包含 'first' 和 'last' 列,其中包含每个玩家的名字和姓氏以及我从同一网站下载的许多其他数据。
到目前为止,我的爬虫功能定义如下:
def bbref_crawler(ID):
url = 'https://www.baseball-reference.com/register/player.fcgi?id=' + str(ID)
source_code = requests.get(url)
page_soup = soup(source_code.text, features='lxml')
到目前为止,我尝试获取玩家 ID 的代码如下:
for x in nwl_offense:
while len(nwl_offense['last']) > 6:
id_last = len(nwl_offense['last']) - 1
while len(nwl_offense['first']) > 3:
id_first = len(nwl_offense['first']) - 1
nwl_offense['player_id'] = (str(id_first) + '000' + str(id_last))
当我 运行 for / while 循环它永远不会停止 运行ning 并且我不确定如何才能实现我设定的将玩家 ID 自动化到该数据框的另一列,因此我可以轻松地使用爬虫获取项目所需的有关玩家的更多信息。
这是数据框的前 5 行,nwl_offense 看起来像:
print(nwl_offense.head())
Rk Name Age G ... WRC+ WRC
WSB OWins
0 1.0 Brian Baker 20.0 14.0 ... 733.107636 2.007068 0.099775 0.189913
1 2.0 Drew Beazley 21.0 46.0 ... 112.669541 29.920766 -0.456988 2.655892
2 3.0 Jarrett Bickel 21.0 33.0 ... 85.017293 15.245547 1.419822 1.502232
3 4.0 Nate Boyle 23.0 21.0 ... 1127.591556 1.543534 0.000000 0.139136
4 5.0 Seth Brewer* 22.0 12.0 ... 243.655365 1.667671 0.099775 0.159319
如评论中所述,我不会尝试创建一个函数来创建 ID,因为其中可能会有一些“古怪”的 ID 可能不遵循该逻辑。
如果您只是搜索每个字母,他们会将其除以并由玩家直接获取 ID url。
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.baseball-reference.com/register/player.fcgi'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
player_register_search = {}
searchLinks = soup.find('div', {'id':'div_players'}).find_all('li')
for each in searchLinks:
links = each.find_all('a', href=True)
for link in links:
print(link)
player_register_search[link.text] = 'https://www.baseball-reference.com/' + link['href']
tot = len(player_register_search)
playerIds = {}
for count, (k, link)in enumerate(player_register_search.items(), start=1):
print(f'{count} of {tot} - {link}')
response = requests.get(link)
soup = BeautifulSoup(response.text, 'html.parser')
kLower = k.lower()
playerSection = soup.find('div', {'id':f'all_players_{kLower}'})
h2 = playerSection.find('h2').text
#print('\t',h2)
player_links = playerSection.find_all('a', href=True)
for player in player_links:
playerName = player.text.strip()
playerId = player['href'].split('id=')[-1].strip()
if playerName not in playerIds.keys():
playerIds[playerName] = []
#print(f'\t{playerName}: {playerId}')
playerIds[playerName].append(playerId)
df = pd.DataFrame({'Player' : list(playerIds.keys()),
'id': list(playerIds.values())})
输出:
print(df)
Player id
0 Scott A'Hara [ahara-000sco]
1 A'Heasy [ahease001---]
2 Al Aaberg [aaberg001alf]
3 Kirk Aadland [aadlan001kir]
4 Zach Aaker [aaker-000zac]
... ...
323628 Mike Zywica [zywica001mic]
323629 Joseph Zywiciel [zywici000jos]
323630 Bobby Zywicki [zywick000bob]
323631 Brandon Zywicki [zywick000bra]
323632 Nate Zyzda [zyzda-000nat]
[323633 rows x 2 columns]
从您的数据框中获取玩家:
这只是您的数据框的一个示例。不要在您的代码中包含此内容
# Sample of the dataframe
nwl_offense = pd.DataFrame({'first':['Evan', 'Kelby'],
'last':['Albrecht', 'Golladay']})
使用这个:
# YOU DATAFRAME - GET LIST OF NAMES
player_interest_list = list(nwl_offense['Name'])
nwl_players = df.loc[df['Player'].isin(player_interest_list)]
输出:
print(nwl_players)
Player id
3095 Evan Albrecht [albrec001eva, albrec000eva]
108083 Kelby Golladay [gollad000kel]