pandas 中二进制变量之间的相关性

Question

我正在尝试使用 Cramer 统计计算二元变量之间的相关性：

def cramers_corrected_stat(confusion_matrix):

    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2/n
    r,k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))    
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

但是我不知道如何在我的数据集中应用上面的代码：

    CL  UP  NS  P   CL_S
480  1  0   1   0   1
1232 1  0   1   0   1
2308 1  1   1   0   1
1590 1  0   1   0   1
497  1  1   0   0   1
... ... ... ... ... ...
1066    1   1   1   0   1
1817    1   0   1   0   1
2411    1   1   1   0   1
2149    1   0   1   0   1
1780    1   0   1   0   1

非常感谢您的指导

Answer 1

您创建的函数不适合您的数据集。

所以，使用下面给出的follow函数cramers_V(var1,var2)。

from scipy.stats import chi2_contingency
def cramers_V(var1,var2):
  crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
  stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
  obs = np.sum(crosstab) # Number of observations
  mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
  return (stat/(obs*mini))

使用该函数的示例代码如下。

cramers_V(df["CL"], df["NS"])

如果您想计算数据集的所有可能对，请使用此代码。

import itertools
for col1, col2 in itertools.combinations(df.columns, 2):
    print(col1, col2, cramers_V(df[col1], df[col2]))

pandas 中二进制变量之间的相关性

Correlation between binary variables in pandas

python

correlation

pandas

confusion-matrix