在世界杯锦标赛组内生成配对
Generate pairings within World Cup tournament groups
我整理了一些 2015 年 FIFA 女足世界杯的数据:
import pandas as pd
df = pd.DataFrame({
'team':['Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain',
'China','New Zealand','South Korea','Switzerland','Mexico','Colombia','Thailand','Nigeria','Ecuador','Ivory Coast','Cameroon','Costa Rica'],
'group':['B','D','F','C','D','F','E','A','D','B','A','E','A','A','E','C','F','F','B','D','C','B','C','E'],
'fifascore':[2168,2158,2103,2066,2008,2001,1984,1969,1968,1933,1919,1867,1847,1832,1830,1813,1748,1692,1651,1633,1485,1373,1455,1589],
'ftescore':[95.6,95.4,92.4,92.7,91.6,89.6,92.2,90.1,88.7,88.7,86.2,84.7,85.2,82.5,84.3,83.7,81.1,78.0,68.0,85.7,63.3,75.6,79.3,72.8]
})
df.groupby(['group', 'team']).mean()
现在我想生成一个新的数据框,其中包含来自 df
的每个 group
中的 6 个可能的配对或匹配,格式如下:
group team1 team2
A Canada China
A Canada Netherlands
A Canada New Zealand
A China Netherlands
A China New Zealand
A Netherlands New Zealand
B Germany Ivory Coast
B Germany Norway
...
执行此操作的简洁明了的方法是什么?我可以通过每个 group
和 team
做一堆循环,但我觉得应该有一个更清晰的矢量化方法来使用 pandas
和 split-apply-combine 范例。
编辑: 我也欢迎任何 R 答案,认为在这里比较 R 和 Pandas 方式会很有趣。添加了 r
标签。
这是评论中要求的 R 形式的数据:
team <- c('Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain',
'China','New Zealand','South Korea','Switzerland','Mexico','Colombia','Thailand','Nigeria','Ecuador','Ivory Coast','Cameroon','Costa Rica')
group <- c('B','D','F','C','D','F','E','A','D','B','A','E','A','A','E','C','F','F','B','D','C','B','C','E')
fifascore <- c(2168,2158,2103,2066,2008,2001,1984,1969,1968,1933,1919,1867,1847,1832,1830,1813,1748,1692,1651,1633,1485,1373,1455,1589)
ftescore <- c(95.6,95.4,92.4,92.7,91.6,89.6,92.2,90.1,88.7,88.7,86.2,84.7,85.2,82.5,84.3,83.7,81.1,78.0,68.0,85.7,63.3,75.6,79.3,72.8)
df <- data.frame(team, group, fifascore, ftescore)
这是两行解决方案:
import itertools
for grpname,grpteams in df.groupby('group')['team']:
# No need to use grpteams.tolist() to convert from pandas Series to Python list
print list(itertools.combinations(grpteams, 2))
[('Canada', 'Netherlands'), ('Canada', 'China'), ('Canada', 'New Zealand'), ('Netherlands', 'China'), ('Netherlands', 'New Zealand'), ('China', 'New Zealand')]
[('Germany', 'Norway'), ('Germany', 'Thailand'), ('Germany', 'Ivory Coast'), ('Norway', 'Thailand'), ('Norway', 'Ivory Coast'), ('Thailand', 'Ivory Coast')]
[('Japan', 'Switzerland'), ('Japan', 'Ecuador'), ('Japan', 'Cameroon'), ('Switzerland', 'Ecuador'), ('Switzerland', 'Cameroon'), ('Ecuador', 'Cameroon')]
[('USA', 'Sweden'), ('USA', 'Australia'), ('USA', 'Nigeria'), ('Sweden', 'Australia'), ('Sweden', 'Nigeria'), ('Australia', 'Nigeria')]
[('Brazil', 'Spain'), ('Brazil', 'South Korea'), ('Brazil', 'Costa Rica'), ('Spain', 'South Korea'), ('Spain', 'Costa Rica'), ('South Korea', 'Costa Rica')]
[('France', 'England'), ('France', 'Mexico'), ('France', 'Colombia'), ('England', 'Mexico'), ('England', 'Colombia'), ('Mexico', 'Colombia')]
解释:
首先,我们使用 df.groupby('group')
获取每个组中的团队列表,遍历该列表并访问其 'team' 系列,以获取每个组中 4 个团队的列表:
for grpname,grpteams in df.groupby('group')['team']:
teamlist = grpteams.tolist()
...
['Canada', 'Netherlands', 'China', 'New Zealand']
['Germany', 'Norway', 'Thailand', 'Ivory Coast']
['Japan', 'Switzerland', 'Ecuador', 'Cameroon']
['USA', 'Sweden', 'Australia', 'Nigeria']
['Brazil', 'Spain', 'South Korea', 'Costa Rica']
['France', 'England', 'Mexico', 'Colombia']
然后我们生成球队元组的全能列表。
David Arenburg 的 post 提醒我使用 itertools.combinations(..., 2)
。但我们可以使用生成器或嵌套 for 循环:
def all_play_all(teams):
for team1 in teams:
for team2 in teams:
if team1 < team2: # [Note] We don't need to generate indices then index into teamlist, just use direct string comparison
yield (team1,team2)
>>> [match for match in all_play_all(grpteams)]
[('France', 'Mexico'), ('England', 'France'), ('England', 'Mexico'), ('Colombia', 'France'), ('Colombia', 'England'), ('Colombia', 'Mexico')]
请注意,我们采用了一种捷径,首先生成所有可能的索引元组,然后使用这些索引到团队列表中:
>>> T = len(teamlist) + 1
>>> [(i,j) for i in range(T) for j in range(T) if i<j]
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(注意:如果我们使用直接比较团队名称的方法,它会产生(按字母顺序)重新排序组名称的轻微副作用(它们最初是按种子排序的,而不是按字母顺序排序的),所以例如'China' < 'Netherlands',因此他们的配对将显示为 ('Netherlands','China') 而不是 ('China',Netherlands'))
使用 R,这是一个可能的 data.table
解决方案,使用它在 GitHub
上的开发版本
#### To install development version
## library(devtools)
## install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table) ## v >= 1.9.5
setDT(df)[, transpose(combn(team, 2L, simplify = FALSE)), keyby = group]
# group V1 V2
# 1: A Canada Netherlands
# 2: A Canada China
# 3: A Canada New Zealand
# 4: A Netherlands China
# 5: A Netherlands New Zealand
# 6: A China New Zealand
# 7: B Germany Norway
# 8: B Germany Thailand
...
我整理了一些 2015 年 FIFA 女足世界杯的数据:
import pandas as pd
df = pd.DataFrame({
'team':['Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain',
'China','New Zealand','South Korea','Switzerland','Mexico','Colombia','Thailand','Nigeria','Ecuador','Ivory Coast','Cameroon','Costa Rica'],
'group':['B','D','F','C','D','F','E','A','D','B','A','E','A','A','E','C','F','F','B','D','C','B','C','E'],
'fifascore':[2168,2158,2103,2066,2008,2001,1984,1969,1968,1933,1919,1867,1847,1832,1830,1813,1748,1692,1651,1633,1485,1373,1455,1589],
'ftescore':[95.6,95.4,92.4,92.7,91.6,89.6,92.2,90.1,88.7,88.7,86.2,84.7,85.2,82.5,84.3,83.7,81.1,78.0,68.0,85.7,63.3,75.6,79.3,72.8]
})
df.groupby(['group', 'team']).mean()
现在我想生成一个新的数据框,其中包含来自 df
的每个 group
中的 6 个可能的配对或匹配,格式如下:
group team1 team2
A Canada China
A Canada Netherlands
A Canada New Zealand
A China Netherlands
A China New Zealand
A Netherlands New Zealand
B Germany Ivory Coast
B Germany Norway
...
执行此操作的简洁明了的方法是什么?我可以通过每个 group
和 team
做一堆循环,但我觉得应该有一个更清晰的矢量化方法来使用 pandas
和 split-apply-combine 范例。
编辑: 我也欢迎任何 R 答案,认为在这里比较 R 和 Pandas 方式会很有趣。添加了 r
标签。
这是评论中要求的 R 形式的数据:
team <- c('Germany','USA','France','Japan','Sweden','England','Brazil','Canada','Australia','Norway','Netherlands','Spain',
'China','New Zealand','South Korea','Switzerland','Mexico','Colombia','Thailand','Nigeria','Ecuador','Ivory Coast','Cameroon','Costa Rica')
group <- c('B','D','F','C','D','F','E','A','D','B','A','E','A','A','E','C','F','F','B','D','C','B','C','E')
fifascore <- c(2168,2158,2103,2066,2008,2001,1984,1969,1968,1933,1919,1867,1847,1832,1830,1813,1748,1692,1651,1633,1485,1373,1455,1589)
ftescore <- c(95.6,95.4,92.4,92.7,91.6,89.6,92.2,90.1,88.7,88.7,86.2,84.7,85.2,82.5,84.3,83.7,81.1,78.0,68.0,85.7,63.3,75.6,79.3,72.8)
df <- data.frame(team, group, fifascore, ftescore)
这是两行解决方案:
import itertools
for grpname,grpteams in df.groupby('group')['team']:
# No need to use grpteams.tolist() to convert from pandas Series to Python list
print list(itertools.combinations(grpteams, 2))
[('Canada', 'Netherlands'), ('Canada', 'China'), ('Canada', 'New Zealand'), ('Netherlands', 'China'), ('Netherlands', 'New Zealand'), ('China', 'New Zealand')]
[('Germany', 'Norway'), ('Germany', 'Thailand'), ('Germany', 'Ivory Coast'), ('Norway', 'Thailand'), ('Norway', 'Ivory Coast'), ('Thailand', 'Ivory Coast')]
[('Japan', 'Switzerland'), ('Japan', 'Ecuador'), ('Japan', 'Cameroon'), ('Switzerland', 'Ecuador'), ('Switzerland', 'Cameroon'), ('Ecuador', 'Cameroon')]
[('USA', 'Sweden'), ('USA', 'Australia'), ('USA', 'Nigeria'), ('Sweden', 'Australia'), ('Sweden', 'Nigeria'), ('Australia', 'Nigeria')]
[('Brazil', 'Spain'), ('Brazil', 'South Korea'), ('Brazil', 'Costa Rica'), ('Spain', 'South Korea'), ('Spain', 'Costa Rica'), ('South Korea', 'Costa Rica')]
[('France', 'England'), ('France', 'Mexico'), ('France', 'Colombia'), ('England', 'Mexico'), ('England', 'Colombia'), ('Mexico', 'Colombia')]
解释:
首先,我们使用 df.groupby('group')
获取每个组中的团队列表,遍历该列表并访问其 'team' 系列,以获取每个组中 4 个团队的列表:
for grpname,grpteams in df.groupby('group')['team']:
teamlist = grpteams.tolist()
...
['Canada', 'Netherlands', 'China', 'New Zealand']
['Germany', 'Norway', 'Thailand', 'Ivory Coast']
['Japan', 'Switzerland', 'Ecuador', 'Cameroon']
['USA', 'Sweden', 'Australia', 'Nigeria']
['Brazil', 'Spain', 'South Korea', 'Costa Rica']
['France', 'England', 'Mexico', 'Colombia']
然后我们生成球队元组的全能列表。
David Arenburg 的 post 提醒我使用 itertools.combinations(..., 2)
。但我们可以使用生成器或嵌套 for 循环:
def all_play_all(teams):
for team1 in teams:
for team2 in teams:
if team1 < team2: # [Note] We don't need to generate indices then index into teamlist, just use direct string comparison
yield (team1,team2)
>>> [match for match in all_play_all(grpteams)]
[('France', 'Mexico'), ('England', 'France'), ('England', 'Mexico'), ('Colombia', 'France'), ('Colombia', 'England'), ('Colombia', 'Mexico')]
请注意,我们采用了一种捷径,首先生成所有可能的索引元组,然后使用这些索引到团队列表中:
>>> T = len(teamlist) + 1
>>> [(i,j) for i in range(T) for j in range(T) if i<j]
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
(注意:如果我们使用直接比较团队名称的方法,它会产生(按字母顺序)重新排序组名称的轻微副作用(它们最初是按种子排序的,而不是按字母顺序排序的),所以例如'China' < 'Netherlands',因此他们的配对将显示为 ('Netherlands','China') 而不是 ('China',Netherlands'))
使用 R,这是一个可能的 data.table
解决方案,使用它在 GitHub
#### To install development version
## library(devtools)
## install_github("Rdatatable/data.table", build_vignettes = FALSE)
library(data.table) ## v >= 1.9.5
setDT(df)[, transpose(combn(team, 2L, simplify = FALSE)), keyby = group]
# group V1 V2
# 1: A Canada Netherlands
# 2: A Canada China
# 3: A Canada New Zealand
# 4: A Netherlands China
# 5: A Netherlands New Zealand
# 6: A China New Zealand
# 7: B Germany Norway
# 8: B Germany Thailand
...