在循环变量时获取产生最高/最低皮尔逊相关性的变量
Get the variables that produce the highest / lowest pearson correlation while looping the variables
我正在努力实现以下目标:
我有一个数据框,其中包含许多列,其中包含指标和一些维度,例如 country
、device
、name
。这 3 个维度中的每一个都有一些独特的值,我在使用 pd.corr()
.
之前用它们来过滤数据
为了演示,我将使用 titanic 数据集。
import seaborn as sns
df_test = sns.load_dataset('titanic')
for who in df_test['who'].unique():
for where in df_test['embark_town'].unique():
print(df_test[(df_test['who']==who)&(df_test['embark_town']==where)].corr())
产生 df_test['who'].nunique()*df_test['embark_town'].nunique()
、9
不同的 pd.corr()
相关性。
下面的例子:
survived pclass age sibsp parch fare
survived 1.000000 -0.198092 0.062199 -0.046691 -0.071417 0.108706
pclass -0.198092 1.000000 -0.438377 0.008843 -0.015523 -0.485546
age 0.062199 -0.438377 1.000000 -0.049317 0.077529 0.199062
sibsp -0.046691 0.008843 -0.049317 1.000000 0.464033 0.358680
parch -0.071417 -0.015523 0.077529 0.464033 1.000000 0.415207
fare 0.108706 -0.485546 0.199062 0.358680 0.415207 1.000000
adult_male NaN NaN NaN NaN NaN NaN
alone 0.030464 0.133638 -0.022396 -0.629845 -0.506964 -0.411392
我正在尝试获取可以回答这个问题的数据:
在什么设置中,每个变量之间的相关性最高/最低,输出可能是 list
、dict
、df
,如下所示:
output = {'highest_corr_survived_p_class':['who = man', 'embark_town = Southampton', 0.65],
'lowest_corr_survived_p_class':['who = man', 'embark_town = Cherbourg',-0.32],
'highest_corr_survived_age':['who = female', 'embark_town = Cherbourg',0.75],
'lowest_corr_survived_age':['who = man', 'embark_town = Cherbourg',-0.3]
...
'lowest_corr_alone_fare':['who = man', 'embark_town = Cherbourg',-0.7]}
我遇到的困难是找到创建此数据的好方法以及如何将其放入 df
。
我尝试过的:
output = {}
for who in df_test['who'].dropna().unique():
for where in df_test['embark_town'].dropna().unique():
output[f'{who}_{where}_corr'] = df_test[(df_test['who']==who)&(df_test['embark_town']==where)].corr().loc['survived','pclass']
产生 output
:
{'man_Southampton_corr': -0.19809207465001574,
'man_Cherbourg_corr': -0.2102998217386208,
'man_Queenstown_corr': 0.06717166132798494,
'woman_Southampton_corr': -0.5525868192717193,
'woman_Cherbourg_corr': -0.5549942419871897,
'woman_Queenstown_corr': -0.16896381511084563,
'child_Southampton_corr': -0.5086941796202842,
'child_Cherbourg_corr': -0.2390457218668788,
'child_Queenstown_corr': nan}
而且这种方法不关心什么是 max
或 min
相关性,这是我的目标。
我不确定如何使用 loc[]
在列之间添加所有可能的变化,或者是否有更好/更简单的方法将所有内容放入 df
?
您可以使用 DataFrameGroupBy.corr
with DataFrame.stack
,删除 1
和 -1
行,并通过以下方式获取每组的最大值和最小值
DataFrameGroupBy.idxmax
,
DataFrameGroupBy.idxmin
for indices with Series.loc
for select, join together by concat
最后使用字典理解最终 dict
:
import seaborn as sns
df_test = sns.load_dataset('titanic')
# print (df_test)
s = df_test.groupby(['who','embark_town']).corr().stack()
s = s[~s.isin([1, -1])]
s = (pd.concat([s.loc[s.groupby(level=[2,3]).idxmax()],
s.loc[s.groupby(level=[2,3]).idxmin()]], keys=('highest','lowest'))
.sort_index(level=[3,4], sort_remaining=False))
print (s)
who embark_town
highest child Queenstown age alone 0.877346
lowest woman Queenstown age alone -0.767493
highest woman Queenstown age fare 0.520461
lowest child Queenstown age fare -0.877346
highest woman Queenstown age parch 0.633627
lowest woman Queenstown survived parch -0.433029
highest man Queenstown survived pclass 0.067172
lowest woman Cherbourg survived pclass -0.554994
highest man Queenstown survived sibsp 0.232685
lowest child Southampton survived sibsp -0.692578
Length: 84, dtype: float64
output = {f'{k[0]}_corr_{k[3]}_{k[4]}':
[f'who = {k[1]}', f'embark_town = {k[2]}',v] for k, v in s.items()}
print(output)
编辑:对于 TOP3 和 BOTTOM3 可以排序并使用 GroupBy.head
and GroupBy.tail
:
import seaborn as sns
df_test = sns.load_dataset('titanic')
# print (df_test)
s = df_test.groupby(['who','embark_town']).corr().stack()
s = s[~s.isin([1, -1])].sort_values()
s = (pd.concat([s.groupby(level=[2,3]).head(3),
s.groupby(level=[2,3]).tail(3)], keys=('highest','lowest'))
.sort_index(level=[3,4], sort_remaining=False)
)
print (s)
who embark_town
highest woman Queenstown age alone -0.767493
Cherbourg age alone -0.073881
man Queenstown age alone -0.069001
lowest child Southampton age alone 0.169244
Cherbourg age alone 0.361780
highest woman Southampton survived sibsp -0.252524
man Southampton survived sibsp -0.046691
lowest man Cherbourg survived sibsp 0.125276
woman Queenstown survived sibsp 0.143025
man Queenstown survived sibsp 0.232685
Length: 252, dtype: float64
我正在努力实现以下目标:
我有一个数据框,其中包含许多列,其中包含指标和一些维度,例如 country
、device
、name
。这 3 个维度中的每一个都有一些独特的值,我在使用 pd.corr()
.
为了演示,我将使用 titanic 数据集。
import seaborn as sns
df_test = sns.load_dataset('titanic')
for who in df_test['who'].unique():
for where in df_test['embark_town'].unique():
print(df_test[(df_test['who']==who)&(df_test['embark_town']==where)].corr())
产生 df_test['who'].nunique()*df_test['embark_town'].nunique()
、9
不同的 pd.corr()
相关性。
下面的例子:
survived pclass age sibsp parch fare
survived 1.000000 -0.198092 0.062199 -0.046691 -0.071417 0.108706
pclass -0.198092 1.000000 -0.438377 0.008843 -0.015523 -0.485546
age 0.062199 -0.438377 1.000000 -0.049317 0.077529 0.199062
sibsp -0.046691 0.008843 -0.049317 1.000000 0.464033 0.358680
parch -0.071417 -0.015523 0.077529 0.464033 1.000000 0.415207
fare 0.108706 -0.485546 0.199062 0.358680 0.415207 1.000000
adult_male NaN NaN NaN NaN NaN NaN
alone 0.030464 0.133638 -0.022396 -0.629845 -0.506964 -0.411392
我正在尝试获取可以回答这个问题的数据:
在什么设置中,每个变量之间的相关性最高/最低,输出可能是 list
、dict
、df
,如下所示:
output = {'highest_corr_survived_p_class':['who = man', 'embark_town = Southampton', 0.65],
'lowest_corr_survived_p_class':['who = man', 'embark_town = Cherbourg',-0.32],
'highest_corr_survived_age':['who = female', 'embark_town = Cherbourg',0.75],
'lowest_corr_survived_age':['who = man', 'embark_town = Cherbourg',-0.3]
...
'lowest_corr_alone_fare':['who = man', 'embark_town = Cherbourg',-0.7]}
我遇到的困难是找到创建此数据的好方法以及如何将其放入 df
。
我尝试过的:
output = {}
for who in df_test['who'].dropna().unique():
for where in df_test['embark_town'].dropna().unique():
output[f'{who}_{where}_corr'] = df_test[(df_test['who']==who)&(df_test['embark_town']==where)].corr().loc['survived','pclass']
产生 output
:
{'man_Southampton_corr': -0.19809207465001574,
'man_Cherbourg_corr': -0.2102998217386208,
'man_Queenstown_corr': 0.06717166132798494,
'woman_Southampton_corr': -0.5525868192717193,
'woman_Cherbourg_corr': -0.5549942419871897,
'woman_Queenstown_corr': -0.16896381511084563,
'child_Southampton_corr': -0.5086941796202842,
'child_Cherbourg_corr': -0.2390457218668788,
'child_Queenstown_corr': nan}
而且这种方法不关心什么是 max
或 min
相关性,这是我的目标。
我不确定如何使用 loc[]
在列之间添加所有可能的变化,或者是否有更好/更简单的方法将所有内容放入 df
?
您可以使用 DataFrameGroupBy.corr
with DataFrame.stack
,删除 1
和 -1
行,并通过以下方式获取每组的最大值和最小值
DataFrameGroupBy.idxmax
,
DataFrameGroupBy.idxmin
for indices with Series.loc
for select, join together by concat
最后使用字典理解最终 dict
:
import seaborn as sns
df_test = sns.load_dataset('titanic')
# print (df_test)
s = df_test.groupby(['who','embark_town']).corr().stack()
s = s[~s.isin([1, -1])]
s = (pd.concat([s.loc[s.groupby(level=[2,3]).idxmax()],
s.loc[s.groupby(level=[2,3]).idxmin()]], keys=('highest','lowest'))
.sort_index(level=[3,4], sort_remaining=False))
print (s)
who embark_town
highest child Queenstown age alone 0.877346
lowest woman Queenstown age alone -0.767493
highest woman Queenstown age fare 0.520461
lowest child Queenstown age fare -0.877346
highest woman Queenstown age parch 0.633627
lowest woman Queenstown survived parch -0.433029
highest man Queenstown survived pclass 0.067172
lowest woman Cherbourg survived pclass -0.554994
highest man Queenstown survived sibsp 0.232685
lowest child Southampton survived sibsp -0.692578
Length: 84, dtype: float64
output = {f'{k[0]}_corr_{k[3]}_{k[4]}':
[f'who = {k[1]}', f'embark_town = {k[2]}',v] for k, v in s.items()}
print(output)
编辑:对于 TOP3 和 BOTTOM3 可以排序并使用 GroupBy.head
and GroupBy.tail
:
import seaborn as sns
df_test = sns.load_dataset('titanic')
# print (df_test)
s = df_test.groupby(['who','embark_town']).corr().stack()
s = s[~s.isin([1, -1])].sort_values()
s = (pd.concat([s.groupby(level=[2,3]).head(3),
s.groupby(level=[2,3]).tail(3)], keys=('highest','lowest'))
.sort_index(level=[3,4], sort_remaining=False)
)
print (s)
who embark_town
highest woman Queenstown age alone -0.767493
Cherbourg age alone -0.073881
man Queenstown age alone -0.069001
lowest child Southampton age alone 0.169244
Cherbourg age alone 0.361780
highest woman Southampton survived sibsp -0.252524
man Southampton survived sibsp -0.046691
lowest man Cherbourg survived sibsp 0.125276
woman Queenstown survived sibsp 0.143025
man Queenstown survived sibsp 0.232685
Length: 252, dtype: float64