将大型数据帧转换为 nd.array，执行 spearman corr

Question

我有一个大数据，包含作为索引的样本和作为 header (500 X 30000) 的名称。例如：

          Name1    Name2    Name3
Sample1   232.12   0.239    -0.324
Sample2   0.928    23.213   -0.056
Sample3   -0.231   7.7776   -0.984

我想得到什么：

          Name1    Name2    Name3
Name1      1        0.001    corr val
Name2      corr val   1      corr val
Name3      corr val  corr val   1

等..

我考虑过：

np.corrcoef(data)

但这只是“pearsons”，而且我在声明数据过大时遇到错误。

我试过拆分

lst = []
data = For_spearman.to_numpy()
#data = np.delete(data, (0), axis=0)
data_size = len(data)-1
for key1 in range(1, data_size): #Ignoring first column which is index
    if key1 != data_size-1: # Cant compare after the last row, so -1 and -1.
        for key2 in range(key1+1 ,data_size): # Comparing name1 vs name2
            test = scipy.stats.spearmanr(data[key1][1:], data[key2][1:])
            lst .append([data[key1][0], data[key2][0], test])
            pd.DataFrame(lst ).to_csv('ForSpearman.csv')

但我总是一团糟，因为我总是被 nd.array 以某种方式纠缠.. 我怎样才能做“np.corrcoef”工作，但以“spearman”的方式进行拆分，以便每次都将一个数组与另一个数组进行比较？

Answer 1

这就是你的问题，你正在尝试创建一个 30000 x 30000 的矩阵，仅此矩阵就有 7.2GB。 16GB 可能不足以用于中间阵列。不过，一种方法是循环。它会很慢，但在您的系统上可能可行：

df = pd.DataFrame(np.random.rand(500, 30000))

out = pd.DataFrame(index=df.columns, columns = df.columns)

# you can also loop in chunks of columns
for col in df:
    out[col] = df.corrwith(df[col], method='spearman')

更新: 以下可能是内存需求较少

out = pd.concat([df.corrwith(df[col], method='spearman')
                   .to_frame(name=col) for col in df.columns],
                 axis=1)

尽管如此，我认为在这种情况下 12~16GB 是非常有限的。此外，循环将永远持续下去。

将大型数据帧转换为 nd.array，执行 spearman corr

Converting large dataframe to nd.array, doing spearman corr

correlation

pandas

numpy-ndarray