如何计算 Python 中的相关率或 Eta？

Question

根据这个post的回答，

The most classic "correlation" measure between a nominal and an interval ("numeric") variable is Eta, also called correlation ratio, and equal to the root R-square of the one-way ANOVA (with p-value = that of the ANOVA). Eta can be seen as a symmetric association measure, like correlation, because Eta of ANOVA (with the nominal as independent, numeric as dependent) is equal to Pillai's trace of multivariate regression (with the numeric as independent, set of dummy variables corresponding to the nominal as dependent).

如果你能告诉我如何计算 python 中的 Eta，我将不胜感激。

事实上，我有一个包含一些数字变量和一些标称变量的数据框。

此外，如何为它绘制热图？

Answer 1

答案已提供here：

def correlation_ratio(categories, measurements):
        fcat, _ = pd.factorize(categories)
        cat_num = np.max(fcat)+1
        y_avg_array = np.zeros(cat_num)
        n_array = np.zeros(cat_num)
        for i in range(0,cat_num):
            cat_measures = measurements[np.argwhere(fcat == i).flatten()]
            n_array[i] = len(cat_measures)
            y_avg_array[i] = np.average(cat_measures)
        y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
        numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
        denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
        if numerator == 0:
            eta = 0.0
        else:
            eta = numerator/denominator
        return eta

Answer 2

上面的答案缺少根提取，因此您将得到一个 eta 平方。但是，在主要 article（由 User777 使用）中，该问题已得到修复。
因此，维基百科上有一篇关于维基 correlation ratio is and how to calculate it. I've created a simpler version of the calculations and will use the example 的文章：

import pandas as pd
import numpy as np

data = {'subjects': ['algebra'] * 5 + ['geometry'] * 4 + ['statistics'] * 6,
        'scores': [45, 70, 29, 15, 21, 40, 20, 30, 42, 65, 95, 80, 70, 85, 73]}
df = pd.DataFrame(data=data)

print(df.head(10))

>>> subjects    scores
  0 algebra     45
  1 algebra     70
  2 algebra     29
  3 algebra     15
  4 algebra     21
  5 geometry    40
  6 geometry    20
  7 geometry    30
  8 geometry    42
  9 statistics  65

def correlation_ratio(categories, values):
    categories = np.array(categories)
    values = np.array(values)
    
    ssw = 0
    ssb = 0
    for category in set(categories):
        subgroup = values[np.where(categories == category)[0]]
        ssw += sum((subgroup-np.mean(subgroup))**2)
        ssb += len(subgroup)*(np.mean(subgroup)-np.mean(values))**2

    return (ssb / (ssb + ssw))**.5

coef = correlation_ratio(df['subjects'], df['scores'])

print('Eta_squared: {:.4f}\nEta: {:.4f}'.format(coef**2, coef))

>>> Eta_squared: 0.7033
    Eta: 0.8386

如何计算 Python 中的相关率或 Eta？

How to compute correlation ratio or Eta in Python?

statistics

correlation

python-3.x

pandas

categorical-data