np.corrcoef 行为与 pandas 数据帧

Question

我观察到以下情况，使用我构建的一些脚本来计算每组的总和序列：

In [291]: sums_per_group2
Out[291]: 
        test_group  control_group
one    4551.658544         4449.3
three  3770.712771         3430.5
two    9328.171538         8673.9

In [292]: sums_per_group2.shape
Out[292]: (3, 2)

In [293]: np.corrcoef(sums_per_group2)
Out[293]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]]) 

In [294]: np.corrcoef(sums_per_group2.values)
Out[294]: 
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [295]: sums_per_group2.values.shape  
Out[295]: (3, 2)

In [296]:   np.corrcoef(sums_per_group2.iloc[:,0],sums_per_group2.iloc[:,1])
Out[296]: 
array([[ 1.        ,  0.99853641],
       [ 0.99853641,  1.        ]])  

In [296]:   sums_per_group2.iloc[:,0].shape
Out[296]: (3,)

In [297]:   sums_per_group2.iloc[:,1].shape
Out[297]: (3,)

如您所见，np.corrcoef() 的任何输入之间的形状都非常精确。

有人能帮我理解一下吗？

Answer 1

如果你想将它作为二维数组传递，你需要转置 sum_per_group2 变量：

np.corrcoef(sum_per_group2.T)
# array([[ 1.        ,  0.99853641],
#        [ 0.99853641,  1.        ]])

这是关于 x 参数的文档：

x : array_like A 1-D or 2-D array containing multiple variables and observations. Each row of x represents a variable, and each column a single observation of all those variables. Also see rowvar below.

传入二维数组时，确保列为观察值，行为变量或特征；计算行之间的相关性。或者设置 rowvar=0:

np.corrcoef(sum_per_group2, rowvar=0)
#array([[ 1.        ,  0.99853641],
#       [ 0.99853641,  1.        ]])

如果您不转置二维数组，该方法会将行解释为向量，因此在您的前几种情况下，它会计算所有行组合的相关系数，因为每行都是长度为 2 的向量，你得到所有 1 作为系数，当你尝试用两点拟合一条线时会发生这种情况（总是完美拟合）。

np.corrcoef 行为与 pandas 数据帧

np.corrcoef behavior with pandas dataframes

python

numpy

series

pandas