使用嵌套的 defaultdict 重新分配 pandas 系列值
Reassign pandas series values using nested defaultdict
我正在处理 NFL 数据集,并希望为 df 中的每场比赛做以下映射:
- 我正在尝试用每个球员到该场比赛的冲击者的距离填充一个列 (
DistToRusher
)。
DistToRusher
列当前填充了玩家 ID。
- 我正在尝试将这些播放器 ID 映射到内部字典键中的那些,并用内部字典值替换它们。
- 我有一个 defaultdict-of-dictionaries
dist_dict
看起来像这样:
dist_dict = {play_id1: {player_id1: distance, player_id2: distance ...},
play_id2: {player_id1: distance, player_id2: distance ...}...}
这是我的代码:
def populate_DistToRusher_column(df):
for play_id, players_dict in dist_dict.items():
df[df.PlayId == play_id].replace({'DistToRusher': players_dict}, inplace=True)
return df
此代码运行缓慢(20-30 秒),并且不会更改 DistToRusher
列;当我检查 df 时,DistToRusher
仍然包含玩家 ID 号而不是距离。
这是实际数据的玩具版本:
from collections import defaultdict
import pandas as pd
df = pd.DataFrame.from_dict({'PlayId': {
0: 20170907000118, 1: 20170907000118, 2: 20170907000118,
22: 20170907000139, 23: 20170907000139, 24: 20170907000139},
'NflId': {0: 496723, 1: 2495116, 2: 2495493,
22: 496723, 23: 2495116, 24: 2495493},
'NflIdRusher': {0: 2543773, 1: 2543773, 2: 2543773,
22: 2543773, 23: 2543773, 24: 2543773},
'DistToRusher': {0: 496723, 1: 2495116, 2: 2495493,
22: 496723, 23: 2495116, 24: 2495493}})
dist_dict = {20170907000118: defaultdict(float,
{496723: 6.480871854928166,
2495116: 4.593310353111358,
2495493: 5.44898155621764}),
20170907000139: defaultdict(float,
{496723: 8.583355987025117,
2495116: 5.821151088917024,
2495493: 6.658686056573021})}
我认为这是对的,IIUC:
temp = pd.DataFrame(dist_dict)
df['DistToRusher2'] = df.apply(lambda x: temp[x.PlayId][x.NflId], axis=1)
or
df['DistToRusher2'] = df.apply(lambda x: dist_dict[x.PlayId][x.NflId], axis=1)
输出:
PlayId NflId NflIdRusher DistToRusher DistToRusher2
0 20170907000118 496723 2543773 496723 6.480872
1 20170907000118 2495116 2543773 2495116 4.593310
2 20170907000118 2495493 2543773 2495493 5.448982
22 20170907000139 496723 2543773 496723 8.583356
23 20170907000139 2495116 2543773 2495116 5.821151
24 20170907000139 2495493 2543773 2495493 6.658686
谢谢@oppressionslayer!这就像一个魅力:
df['DistToRusher2'] = df.apply(lambda x: dist_dict[x.PlayId][x.NflId], axis=1)
我正在处理 NFL 数据集,并希望为 df 中的每场比赛做以下映射:
- 我正在尝试用每个球员到该场比赛的冲击者的距离填充一个列 (
DistToRusher
)。 DistToRusher
列当前填充了玩家 ID。- 我正在尝试将这些播放器 ID 映射到内部字典键中的那些,并用内部字典值替换它们。
- 我有一个 defaultdict-of-dictionaries
dist_dict
看起来像这样:
dist_dict = {play_id1: {player_id1: distance, player_id2: distance ...},
play_id2: {player_id1: distance, player_id2: distance ...}...}
这是我的代码:
def populate_DistToRusher_column(df):
for play_id, players_dict in dist_dict.items():
df[df.PlayId == play_id].replace({'DistToRusher': players_dict}, inplace=True)
return df
此代码运行缓慢(20-30 秒),并且不会更改 DistToRusher
列;当我检查 df 时,DistToRusher
仍然包含玩家 ID 号而不是距离。
这是实际数据的玩具版本:
from collections import defaultdict
import pandas as pd
df = pd.DataFrame.from_dict({'PlayId': {
0: 20170907000118, 1: 20170907000118, 2: 20170907000118,
22: 20170907000139, 23: 20170907000139, 24: 20170907000139},
'NflId': {0: 496723, 1: 2495116, 2: 2495493,
22: 496723, 23: 2495116, 24: 2495493},
'NflIdRusher': {0: 2543773, 1: 2543773, 2: 2543773,
22: 2543773, 23: 2543773, 24: 2543773},
'DistToRusher': {0: 496723, 1: 2495116, 2: 2495493,
22: 496723, 23: 2495116, 24: 2495493}})
dist_dict = {20170907000118: defaultdict(float,
{496723: 6.480871854928166,
2495116: 4.593310353111358,
2495493: 5.44898155621764}),
20170907000139: defaultdict(float,
{496723: 8.583355987025117,
2495116: 5.821151088917024,
2495493: 6.658686056573021})}
我认为这是对的,IIUC:
temp = pd.DataFrame(dist_dict)
df['DistToRusher2'] = df.apply(lambda x: temp[x.PlayId][x.NflId], axis=1)
or
df['DistToRusher2'] = df.apply(lambda x: dist_dict[x.PlayId][x.NflId], axis=1)
输出:
PlayId NflId NflIdRusher DistToRusher DistToRusher2
0 20170907000118 496723 2543773 496723 6.480872
1 20170907000118 2495116 2543773 2495116 4.593310
2 20170907000118 2495493 2543773 2495493 5.448982
22 20170907000139 496723 2543773 496723 8.583356
23 20170907000139 2495116 2543773 2495116 5.821151
24 20170907000139 2495493 2543773 2495493 6.658686
谢谢@oppressionslayer!这就像一个魅力:
df['DistToRusher2'] = df.apply(lambda x: dist_dict[x.PlayId][x.NflId], axis=1)