如何根据 pandas 中的其他列对一列的值求和?
How to sum values of one column based on other columns in pandas?
使用如下所示的数据框(下面的文本版本):
我应该计算自 2010 年以来哪个国家/地区在锦标赛中进球最多。到目前为止,我已经设法通过像这样过滤掉友军来操纵数据框:
no_friendlies = df[df.tournament != "Friendly"]
然后我将日期列设置为索引,以便过滤掉2010年之前的所有匹配项:
no_friendlies_indexed = no_friendlies.set_index('date')
since_2010 = no_friendlies_indexed.loc['2010-01-01':]
从这一点开始我很迷茫,因为我不知道如何计算每个国家主场和客场的进球数
任何 help/advice 不胜感激!
编辑:
示例数据的文本版本:
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
3 1875-03-06 England Scotland 2 2 Friendly London England False
4 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland False
5 1876-03-25 Scotland Wales 4 0 Friendly Glasgow Scotland False
6 1877-03-03 England Scotland 1 3 Friendly London England False
7 1877-03-05 Wales Scotland 0 2 Friendly Wrexham Wales False
8 1878-03-02 Scotland England 7 2 Friendly Glasgow Scotland False
9 1878-03-23 Scotland Wales 9 0 Friendly Glasgow Scotland False
10 1879-01-18 England Wales 2 1 Friendly London England False
编辑 2:
我刚试过这样做:
since_2010.groupby(['home_team', 'home_score']).sum()
但这不是 return 主队得分的总和(如果这有效,我会为客队重复它以获得总得分)
.groupby
和 .sum()
为主队,然后为客队做同样的事情,并将两者相加:
df_new = df.groupby('home_team')['home_score'].sum() + df.groupby('away_team')['away_score'].sum()
输出:
England 12
Scotland 34
Wales 1
更详细的解释(根据评论):
- 您只需要
.groupby
一栏 home_team
。在您的回答中,您按 ['home_team', 'home_score']
分组 您的目标(没有双关语意)是获得 home_score
的 .sum()
- 所以您应该 NOT.groupby()
吧。如您所见,['home_score']
在我使用 .groupby
的部分之后,因此我可以获得它的 .sum()
。这让你为主队做好了准备。
- 然后,您对
away_team
执行相同的操作。
- 此时 python / pandas 足够聪明,因为
home_team
和 away_team
组的结果具有相同的国家值,您可以简单地将它们加在一起...
使用pd.wide_to_long
重塑。好处是它会自动创建一个 'home_or_away'
指标,但我们将首先更改列,使它们成为 'score_home'(而不是 'home_score')。
# Swap column stubs around `'_'`
df.columns = ['_'.join(x[::-1]) for x in df.columns.str.split('_')]
# Your code to filter, would drop everything in your provided example
# df['date'] = pd.to_datetime(df['date'])
# df[df['date'].dt.year.gt(2010) & df['tournament'].ne('Friendly')]
df = pd.wide_to_long(df, i='date', j='home_or_away',
stubnames=['team', 'score'], sep='_', suffix='.*')
# country neutral tournament city team score
#date home_or_away
#1872-11-30 home Scotland False Friendly Glasgow Scotland 0
#1873-03-08 home England False Friendly London England 4
#1874-03-07 home Scotland False Friendly Glasgow Scotland 2
#...
#1878-03-02 away Scotland False Friendly Glasgow England 2
#1878-03-23 away Scotland False Friendly Glasgow Wales 0
#1879-01-18 away England False Friendly London Wales 1
所以现在无论主场还是客场,都可以获得积分:
df.groupby('team')['score'].sum()
#team
#England 12
#Scotland 34
#Wales 1
#Name: score, dtype: int64
使用如下所示的数据框(下面的文本版本):
我应该计算自 2010 年以来哪个国家/地区在锦标赛中进球最多。到目前为止,我已经设法通过像这样过滤掉友军来操纵数据框:
no_friendlies = df[df.tournament != "Friendly"]
然后我将日期列设置为索引,以便过滤掉2010年之前的所有匹配项:
no_friendlies_indexed = no_friendlies.set_index('date')
since_2010 = no_friendlies_indexed.loc['2010-01-01':]
从这一点开始我很迷茫,因为我不知道如何计算每个国家主场和客场的进球数
任何 help/advice 不胜感激!
编辑:
示例数据的文本版本:
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
3 1875-03-06 England Scotland 2 2 Friendly London England False
4 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland False
5 1876-03-25 Scotland Wales 4 0 Friendly Glasgow Scotland False
6 1877-03-03 England Scotland 1 3 Friendly London England False
7 1877-03-05 Wales Scotland 0 2 Friendly Wrexham Wales False
8 1878-03-02 Scotland England 7 2 Friendly Glasgow Scotland False
9 1878-03-23 Scotland Wales 9 0 Friendly Glasgow Scotland False
10 1879-01-18 England Wales 2 1 Friendly London England False
编辑 2:
我刚试过这样做:
since_2010.groupby(['home_team', 'home_score']).sum()
但这不是 return 主队得分的总和(如果这有效,我会为客队重复它以获得总得分)
.groupby
和 .sum()
为主队,然后为客队做同样的事情,并将两者相加:
df_new = df.groupby('home_team')['home_score'].sum() + df.groupby('away_team')['away_score'].sum()
输出:
England 12
Scotland 34
Wales 1
更详细的解释(根据评论):
- 您只需要
.groupby
一栏home_team
。在您的回答中,您按['home_team', 'home_score']
分组 您的目标(没有双关语意)是获得home_score
的.sum()
- 所以您应该 NOT.groupby()
吧。如您所见,['home_score']
在我使用.groupby
的部分之后,因此我可以获得它的.sum()
。这让你为主队做好了准备。 - 然后,您对
away_team
执行相同的操作。 - 此时 python / pandas 足够聪明,因为
home_team
和away_team
组的结果具有相同的国家值,您可以简单地将它们加在一起...
使用pd.wide_to_long
重塑。好处是它会自动创建一个 'home_or_away'
指标,但我们将首先更改列,使它们成为 'score_home'(而不是 'home_score')。
# Swap column stubs around `'_'`
df.columns = ['_'.join(x[::-1]) for x in df.columns.str.split('_')]
# Your code to filter, would drop everything in your provided example
# df['date'] = pd.to_datetime(df['date'])
# df[df['date'].dt.year.gt(2010) & df['tournament'].ne('Friendly')]
df = pd.wide_to_long(df, i='date', j='home_or_away',
stubnames=['team', 'score'], sep='_', suffix='.*')
# country neutral tournament city team score
#date home_or_away
#1872-11-30 home Scotland False Friendly Glasgow Scotland 0
#1873-03-08 home England False Friendly London England 4
#1874-03-07 home Scotland False Friendly Glasgow Scotland 2
#...
#1878-03-02 away Scotland False Friendly Glasgow England 2
#1878-03-23 away Scotland False Friendly Glasgow Wales 0
#1879-01-18 away England False Friendly London Wales 1
所以现在无论主场还是客场,都可以获得积分:
df.groupby('team')['score'].sum()
#team
#England 12
#Scotland 34
#Wales 1
#Name: score, dtype: int64