优化变化的变量以获得多列的最大 Pearson 相关系数
Optimize changing variables to get max Pearson's correlation coefficient for multiple columns
修改:
如果我有一个包含 5 列的 pandas DataFrame Col1
& Col2
& Col3
& Col4
& Col5
并且我需要在(Col2
,Col3
) & (Col2
,Col4
) & (Col2
,Col5
) 之间获得最大皮尔逊相关系数考虑 Col1
中的值
下式得到的Col2
的修正值:
df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']
其中 B
是变化的变量(单个值)以获得最大皮尔逊相关系数(Col2
、Col3
的新值)和([= 的新值12=],Col4
) & (Col2
,Col5
的新值).
更新:
上面的table包含我上面提到的5列,(Col2
,Col3
) & (Col2
,[=14=之间的相关系数]) & (Col2
,Col5
) 说明在 table 下面。
我需要根据上述两个等式更改 Col2
的值,其中更改值为 B
。
所以问题是如何获得 B
的最佳值,使新的相关系数大于或等于其对应项(旧)?
更新 2:
第 1 列、第 2 列、第 3 列、第 4 列、第 5 列
2,0.051361397,2618,1453,1099
4,0.053507779,306,153,150
2,0.041236151,39,54,34
6,0.094526419,2755,2209,1947
4,0.079773397,2313,1261,1022
4,0.083891415,3528,2502,2029
6,0.090737243,3594,2781,2508
2,0.069552772,370,234,246
2,0.052401789,690,402,280
2,0.039930675,1218,846,631
4,0.065952096,1706,523,453
2,0.053064126,314,197,123
6,0.076847486,4019,1675,1452
2,0.044881545,604,402,356
2,0.073102611,2214,1263,1050
0,0.046998526,938,648,572
不是很优雅,但是很管用;随意使它更通用:
import pandas as pd
from scipy.optimize import minimize
def minimize_me(b, df):
# we want to maximize, so we have to multiply by -1
return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )
# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')
# B is greater than 0 for now
bnds = [(0, None)]
res = minimize(minimize_me, (1), args=(df,), bounds=bnds)
if res.success:
# that's the optimal B
print(res.x[0])
# that's the highest correlation you can get
print(-1 * res.fun)
else:
print("Sorry, the optimization was not successful. Try with another initial"
" guess or optimization method")
这将打印:
0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)
我现在阅读 clipboard
,将其替换为您的 .csv
文件。然后,您还应该避免对列进行硬编码;上面的代码只是为了演示目的,让您了解如何设置优化问题本身。
如果你对总和感兴趣,可以使用(其余代码未修改):
def minimize_me(b, df):
col_mod = df['Col2'] * df['Col1'] ** b
# we want to maximize, so we have to multiply by -1
return -1 * (df['Col3'].corr(col_mod) +
df['Col4'].corr(col_mod) +
df['Col5'].corr(col_mod))
这将打印:
1.0452394748131613
2.3428368479642137
修改:
如果我有一个包含 5 列的 pandas DataFrame Col1
& Col2
& Col3
& Col4
& Col5
并且我需要在(Col2
,Col3
) & (Col2
,Col4
) & (Col2
,Col5
) 之间获得最大皮尔逊相关系数考虑 Col1
下式得到的Col2
的修正值:
df['Col1']=np.power((df['Col1']),B)
df['Col2']=df['Col2']*df['Col1']
其中 B
是变化的变量(单个值)以获得最大皮尔逊相关系数(Col2
、Col3
的新值)和([= 的新值12=],Col4
) & (Col2
,Col5
的新值).
更新:
上面的table包含我上面提到的5列,(Col2
,Col3
) & (Col2
,[=14=之间的相关系数]) & (Col2
,Col5
) 说明在 table 下面。
我需要根据上述两个等式更改 Col2
的值,其中更改值为 B
。
所以问题是如何获得 B
的最佳值,使新的相关系数大于或等于其对应项(旧)?
更新 2:
第 1 列、第 2 列、第 3 列、第 4 列、第 5 列
2,0.051361397,2618,1453,1099
4,0.053507779,306,153,150
2,0.041236151,39,54,34
6,0.094526419,2755,2209,1947
4,0.079773397,2313,1261,1022
4,0.083891415,3528,2502,2029
6,0.090737243,3594,2781,2508
2,0.069552772,370,234,246
2,0.052401789,690,402,280
2,0.039930675,1218,846,631
4,0.065952096,1706,523,453
2,0.053064126,314,197,123
6,0.076847486,4019,1675,1452
2,0.044881545,604,402,356
2,0.073102611,2214,1263,1050
0,0.046998526,938,648,572
不是很优雅,但是很管用;随意使它更通用:
import pandas as pd
from scipy.optimize import minimize
def minimize_me(b, df):
# we want to maximize, so we have to multiply by -1
return -1 * df['Col3'].corr(df['Col2'] * df['Col1'] ** b )
# read your dataframe from somehwere, e.g. csv
df = pd.read_clipboard(sep=',')
# B is greater than 0 for now
bnds = [(0, None)]
res = minimize(minimize_me, (1), args=(df,), bounds=bnds)
if res.success:
# that's the optimal B
print(res.x[0])
# that's the highest correlation you can get
print(-1 * res.fun)
else:
print("Sorry, the optimization was not successful. Try with another initial"
" guess or optimization method")
这将打印:
0.9020784246026575 # your B
0.7614993786787415 # highest correlation for corr(col2, col3)
我现在阅读 clipboard
,将其替换为您的 .csv
文件。然后,您还应该避免对列进行硬编码;上面的代码只是为了演示目的,让您了解如何设置优化问题本身。
如果你对总和感兴趣,可以使用(其余代码未修改):
def minimize_me(b, df):
col_mod = df['Col2'] * df['Col1'] ** b
# we want to maximize, so we have to multiply by -1
return -1 * (df['Col3'].corr(col_mod) +
df['Col4'].corr(col_mod) +
df['Col5'].corr(col_mod))
这将打印:
1.0452394748131613
2.3428368479642137