数据帧中每一行的相关系数和 p 值
Correlation coefficient and p value for each row within a datafarme
我有一个矩阵,如下所示,
foo = pd.DataFrame(
[['ASP1',12.45,12.65,1.54,1.56],
['ASP2',4.5,1.4,0.03,1.987],
['ASP3',0.12,0.34,0.45,0.9],
['ASP4',0.65,0.789,0.01,0.876]],
columns = ('Sam','C1','C2','B1','B2'))
foo
Sam C1 C2 B1 B2
0 ASP1 12.45 12.650 1.54 1.560
1 ASP2 4.50 1.400 0.03 1.987
2 ASP3 0.12 0.340 0.45 0.900
3 ASP4 0.65 0.789 0.01 0.876
我想对 Sam 中 C1..C2 和 B1..B2 列之间的每一行进行相关性测试。最后,我的目标矩阵如下,
foo_result = pd.DataFrame(
[['C',0.76,0.06],
['B',0.34,0.10]],
columns = ('Gen','Correlation_coefficent','P-value'))
foo_result
Gene Correlation_coefficent P-value
0 C 0.76 0.060
1 B 0.34 0.100
任何建议或解决方案都会很棒。
谢谢
应该这样做:
from scipy.stats import pearsonr
c_values = [column for column in foo.columns.tolist() if column.startswith('C')]
b_values = [column for column in foo.columns.tolist() if column.startswith('B')]
foo['Correlation_coefficent'], foo['P-value'] = zip(*foo.T.apply(lambda x: pearsonr(x[c_values], x[b_values])))
foo_result = foo[['Sam', 'Correlation_coefficent','P-value']]
输出:
Sam Correlation_coefficent P-value
0 ASP1 1.0 0.0
1 ASP2 -1.0 0.0
2 ASP3 1.0 0.0
3 ASP4 1.0 0.0
你得到这些结果的原因是变量的数量。希望你的原创至少有 3 个值。
from scipy.stats import pearsonr
foo[['corr_coef', 'p_value']] = foo.apply(lambda x: pearsonr(x=x[1:3], y=x[3:5]), axis=1).apply(pd.Series)
输出为
Sam C1 C2 B1 B2 corr_coef p_value
0 ASP1 12.45 12.650 1.54 1.560 1.0 0.0
1 ASP2 4.50 1.400 0.03 1.987 -1.0 0.0
2 ASP3 0.12 0.340 0.45 0.900 1.0 0.0
3 ASP4 0.65 0.789 0.01 0.876 1.0 0.0
如果 C 和 B 各有 112 列,则应使用 pearsonr(x=x[1:113], y=x[113:125])
我有一个矩阵,如下所示,
foo = pd.DataFrame(
[['ASP1',12.45,12.65,1.54,1.56],
['ASP2',4.5,1.4,0.03,1.987],
['ASP3',0.12,0.34,0.45,0.9],
['ASP4',0.65,0.789,0.01,0.876]],
columns = ('Sam','C1','C2','B1','B2'))
foo
Sam C1 C2 B1 B2
0 ASP1 12.45 12.650 1.54 1.560
1 ASP2 4.50 1.400 0.03 1.987
2 ASP3 0.12 0.340 0.45 0.900
3 ASP4 0.65 0.789 0.01 0.876
我想对 Sam 中 C1..C2 和 B1..B2 列之间的每一行进行相关性测试。最后,我的目标矩阵如下,
foo_result = pd.DataFrame(
[['C',0.76,0.06],
['B',0.34,0.10]],
columns = ('Gen','Correlation_coefficent','P-value'))
foo_result
Gene Correlation_coefficent P-value
0 C 0.76 0.060
1 B 0.34 0.100
任何建议或解决方案都会很棒。 谢谢
应该这样做:
from scipy.stats import pearsonr
c_values = [column for column in foo.columns.tolist() if column.startswith('C')]
b_values = [column for column in foo.columns.tolist() if column.startswith('B')]
foo['Correlation_coefficent'], foo['P-value'] = zip(*foo.T.apply(lambda x: pearsonr(x[c_values], x[b_values])))
foo_result = foo[['Sam', 'Correlation_coefficent','P-value']]
输出:
Sam Correlation_coefficent P-value
0 ASP1 1.0 0.0
1 ASP2 -1.0 0.0
2 ASP3 1.0 0.0
3 ASP4 1.0 0.0
你得到这些结果的原因是变量的数量。希望你的原创至少有 3 个值。
from scipy.stats import pearsonr
foo[['corr_coef', 'p_value']] = foo.apply(lambda x: pearsonr(x=x[1:3], y=x[3:5]), axis=1).apply(pd.Series)
输出为
Sam C1 C2 B1 B2 corr_coef p_value
0 ASP1 12.45 12.650 1.54 1.560 1.0 0.0
1 ASP2 4.50 1.400 0.03 1.987 -1.0 0.0
2 ASP3 0.12 0.340 0.45 0.900 1.0 0.0
3 ASP4 0.65 0.789 0.01 0.876 1.0 0.0
如果 C 和 B 各有 112 列,则应使用 pearsonr(x=x[1:113], y=x[113:125])