获取 pandas 数据框的多列(笛卡尔积)的组合?
Get combinations of multiple columns (Cartesian product) of a pandas dataframe?
So I have a dataframe representing various model estimates for the likelihood of each of a group of candidates winning an election.
Steve John
Model1 0.327586 0.289474
Model2 0.322581 0.285714
Model3 0.303030 0.294118
我想要一个数据框来表示跨列的模型值的所有组合,即所有列的笛卡尔积。所以上面的会变成下面的。
model Steve value Steve model John value John
0 Model1 0.327586 Model1 0.289474
1 Model1 0.327586 Model2 0.285714
2 Model1 0.327586 Model3 0.294118
3 Model2 0.322581 Model1 0.289474
4 Model2 0.322581 Model2 0.285714
5 Model2 0.322581 Model3 0.294118
6 Model3 0.303030 Model1 0.289474
7 Model3 0.303030 Model2 0.285714
8 Model3 0.303030 Model3 0.294118
以上是简单的情况,但理论上我希望能够对 N 个模型和 M 个候选者执行此操作,从而得到一个具有 N^M 行和 2M 列的数据框(实际上 N < 20,米 < 6).
在寻找答案时,我看到了很多关于 itertools
模块的建议,但无法弄清楚如何在多个列表中获得所有组合(itertools.combinations
似乎只适用于在单个列表中查找所有组合)。
使用:
from itertools import product
#get all combinations of all columns
a = product(*[zip(df.index, x) for x in df.T.values])
#create new columns names
cols = [c for x in df.columns for c in ('model_' + x, 'value_' + x)]
#flattening nested lists with DataFrame contructor
df1 = pd.DataFrame([[y for x in z for y in x] for z in a], columns=cols)
print (df1)
model_Steve value_Steve model_John value_John
0 Model1 0.327586 Model1 0.289474
1 Model1 0.327586 Model2 0.285714
2 Model1 0.327586 Model3 0.294118
3 Model2 0.322581 Model1 0.289474
4 Model2 0.322581 Model2 0.285714
5 Model2 0.322581 Model3 0.294118
6 Model3 0.303030 Model1 0.289474
7 Model3 0.303030 Model2 0.285714
8 Model3 0.303030 Model3 0.294118
最好提供代码以便我们可以快速创建框架,而不仅仅是 table。您可以通过任何方式创建一个公共 key
并可以像交叉连接这样的数据库来获得最终结果。你可以一行完成,但我是一步一步做的。
import pandas as pd
df = pd.DataFrame({'model': ['model1', 'model2'],
'steve': ['a', 'b'],
'jhon': ['c', 'd']
})
# create a common key
df['key'] = 'xyz'
# create two seperate dataframe for self join
# but it is possible to use the direct operation (right side) in
# inside of merge funciton
df_steve = df [['model', 'steve', 'key']]
df_jhon = df [['model', 'jhon', 'key']]
# self join
pd.merge(df_steve, df_jhon, on='key', suffixes=('_steve', '_jhon')).drop('key', axis=1)
输出:
model_steve steve model_jhon jhon
0 model1 a model1 c
1 model1 a model2 d
2 model2 b model1 c
3 model2 b model2 d
一班代码:
cross_df = pd.merge(df[['model', 'steve', 'key']],
df[['model', 'jhon', 'key']],
on='key',
suffixes=('_steve', '_jhon')
).drop('key', axis=1)
根据需要更改列名即可。
So I have a dataframe representing various model estimates for the likelihood of each of a group of candidates winning an election.
Steve John
Model1 0.327586 0.289474
Model2 0.322581 0.285714
Model3 0.303030 0.294118
我想要一个数据框来表示跨列的模型值的所有组合,即所有列的笛卡尔积。所以上面的会变成下面的。
model Steve value Steve model John value John
0 Model1 0.327586 Model1 0.289474
1 Model1 0.327586 Model2 0.285714
2 Model1 0.327586 Model3 0.294118
3 Model2 0.322581 Model1 0.289474
4 Model2 0.322581 Model2 0.285714
5 Model2 0.322581 Model3 0.294118
6 Model3 0.303030 Model1 0.289474
7 Model3 0.303030 Model2 0.285714
8 Model3 0.303030 Model3 0.294118
以上是简单的情况,但理论上我希望能够对 N 个模型和 M 个候选者执行此操作,从而得到一个具有 N^M 行和 2M 列的数据框(实际上 N < 20,米 < 6).
在寻找答案时,我看到了很多关于 itertools
模块的建议,但无法弄清楚如何在多个列表中获得所有组合(itertools.combinations
似乎只适用于在单个列表中查找所有组合)。
使用:
from itertools import product
#get all combinations of all columns
a = product(*[zip(df.index, x) for x in df.T.values])
#create new columns names
cols = [c for x in df.columns for c in ('model_' + x, 'value_' + x)]
#flattening nested lists with DataFrame contructor
df1 = pd.DataFrame([[y for x in z for y in x] for z in a], columns=cols)
print (df1)
model_Steve value_Steve model_John value_John
0 Model1 0.327586 Model1 0.289474
1 Model1 0.327586 Model2 0.285714
2 Model1 0.327586 Model3 0.294118
3 Model2 0.322581 Model1 0.289474
4 Model2 0.322581 Model2 0.285714
5 Model2 0.322581 Model3 0.294118
6 Model3 0.303030 Model1 0.289474
7 Model3 0.303030 Model2 0.285714
8 Model3 0.303030 Model3 0.294118
最好提供代码以便我们可以快速创建框架,而不仅仅是 table。您可以通过任何方式创建一个公共 key
并可以像交叉连接这样的数据库来获得最终结果。你可以一行完成,但我是一步一步做的。
import pandas as pd
df = pd.DataFrame({'model': ['model1', 'model2'],
'steve': ['a', 'b'],
'jhon': ['c', 'd']
})
# create a common key
df['key'] = 'xyz'
# create two seperate dataframe for self join
# but it is possible to use the direct operation (right side) in
# inside of merge funciton
df_steve = df [['model', 'steve', 'key']]
df_jhon = df [['model', 'jhon', 'key']]
# self join
pd.merge(df_steve, df_jhon, on='key', suffixes=('_steve', '_jhon')).drop('key', axis=1)
输出:
model_steve steve model_jhon jhon
0 model1 a model1 c
1 model1 a model2 d
2 model2 b model1 c
3 model2 b model2 d
一班代码:
cross_df = pd.merge(df[['model', 'steve', 'key']],
df[['model', 'jhon', 'key']],
on='key',
suffixes=('_steve', '_jhon')
).drop('key', axis=1)
根据需要更改列名即可。