将相关函数应用于数据帧的多个子集并将结果连接到一个帧中

Question

我有一个名为 "df" 的 Pandas 数据框，其中包含以下列：

    Income  Income_Quantile Score_1 Score_2 Score_3
0   100000              5     75      75    100
1   70000               4     55      77    80
2   50000               3     66      50    60
3   12000               1     22      60    30
4   35000               2     61      50    53
5   30000               2     66      35    77

我还有一个 "for-loop" 用于使用 "Income_Quantile" 变量选择数据帧的子集。该循环随后删除了用于切片主数据帧的 "Income_Quantile" 变量； "df"。

代码如下：

for level in df.Income_Quantile.unique():
    df_s = df.loc[df.Income_Quantile == level].drop('Income_Quantile', 1)

现在，我想计算 "Income" 变量与 "df_s" 中的 "Score_1"、"Score_2" 和 "Score_3" 变量的 spearman 等级相关性.

我还想将结果连接到一个框架中，结构如下：

            Income Quantile  Score_1    Score_2     Score_3
correlation         ….         ….          ….          ….
p-value             ….         ….          ….          ….
t-statistic         ….         ….          ….          ….

我认为来自我询问的以下方法可能会有所帮助：

result = dict({key: correlations(val) for key, val in df_s.items()}) '''"correlations" will be a helper function for calculating the Spearman's rank correlation of each of the subsets to the "Income" variable and outputing the p-value and t-statistic of the test for each each variable.'''

但是，我目前对如何影响后续步骤一无所知。

有没有人对我如何从现在的位置到达我想去的地方有任何指示？ 这恰好是我在Python中最薄弱的地方，我被卡住了。

Answer 1

这是您所期待的吗？

cols = ['Score_1','Score_2','Score_3']
df_result = pd.DataFrame(columns=cols)
df_result.loc['t-statistic'] = [ttest_ind(df['Income'], df[x])[0] for x in cols]
df_result.loc['p-value'] = [ttest_ind(df['Income'], df[x])[1] for x in cols]
df_result.loc['correlation']= [spearmanr(df['Income'], df[x])[1] for x in cols]
print(df_result)

输出：

              Score_1   Score_2   Score_3
t-statistic  3.842307  3.842281  3.841594
p-value      0.003253  0.003253  0.003257
correlation  0.257369  0.227784  0.041563

这里df_result['Score_1']是df['Income']和df['Score_1']的t统计量、p值和spearman相关性的结果。让我知道这是否有帮助。

将相关函数应用于数据帧的多个子集并将结果连接到一个帧中

Applying a Correlation Function to Multiple Subsets of a Dataframe and Concatenating the Results in one Frame

python

correlation

dataframe

pandas

p-value