如何将 pandas 数据框转换为 defaultdict (class, list) 其中一个列值用作键？

Question

在给定的 pandas 数据帧中：

df = 

     contig       pos  PI_index  hapX_My_Sum  hapY_My_Sum  hapX_Sp_Sum       
 0  2  16229767           726          0.0         12.0          3.5   
 1  2  16229783           726          0.0         12.0          3.5   
 3  2  16229880           726          0.0         12.0          2.0   
 4  2  16230491           255         12.0          0.0          0.0   
 5  2  16230503           255         12.0          0.0          0.0   
 6  2  16232072           255         11.0          1.0          0.0   
 7  2  16232072           255         11.0          1.0          0.0   
 8  2  16232282          3353         11.0          1.0          0.0   
 9  2  16232444          3353         11.0          1.0          0.0   
 10 2  16232444          3353         11.0          1.0          0.0

我想将此数据帧转换为 dictionary of dictionary 即 default(dict)

所以，我做到了：

from collections import defaultdict
df_dict = df.to_dict('index')

print(df_dict)  # gives me
{0: {'hapY_My_Sum': 12.0, 'hapX_Sp_Sum': 3.5 .....}

一切都很好，但我不想使用 main pandas index，而是想使用 PI_index 作为索引来生成 defaultdict(<class 'dict'>，其中 PI_index 值是 keys 进行下游分析。

defaultdict 的打印输出应该是这样的：

defaultdict(<class 'dict'>, {'726': {'contig': '2', 'hapX_My_Sum': ['0.0', '0.0', '0.0'], 'hapY_My_Sum': ['12.0', '12.0', '12.0'], ....}, '255':{'contig': '2', 'hapX_My_Sum': [....]....}})

Post 编辑：

我忘了添加，但有没有办法 取消选择 某些不需要的列，但我不想将它们从 pandas 数据框中删除。
此外，如果我只想要 contig 中的一个值怎么办，因为它们都是相同的。

所以，下游我可以做类似的事情：

for k in df_dict:
    contig = df_dict[k]['chr']

    hapX_My_product = reduce(mul, (float(x) for x in (df_dict[k]['hapX_My_Sum'])))

Answer 1

这是你想要的吗？

In [11]: cols = ['contig','PI_index','hapX_My_Sum']

In [12]: df[cols].groupby('PI_index') \
                 .apply(lambda x: x.set_index('PI_index').to_dict('list')) \
                 .to_dict()
Out[12]:
{255: {'contig': [2, 2, 2, 2], 'hapX_My_Sum': [12.0, 12.0, 11.0, 11.0]},
 726: {'contig': [2, 2, 2], 'hapX_My_Sum': [0.0, 0.0, 0.0]},
 3353: {'contig': [2, 2, 2], 'hapX_My_Sum': [11.0, 11.0, 11.0]}}

一些解释：

首先我们为每个组生成字典

In [87]: df[cols].groupby('PI_index') \
    ...:         .apply(lambda x: x.set_index('PI_index').to_dict('list'))
Out[87]:
PI_index
255     {'contig': [2, 2, 2, 2], 'hapX_My_Sum': [12.0,...
726     {'contig': [2, 2, 2], 'hapX_My_Sum': [0.0, 0.0...
3353    {'contig': [2, 2, 2], 'hapX_My_Sum': [11.0, 11...
dtype: object

现在我们可以将行导出为字典，设置相应的索引并使用默认值orient='dict'

In [88]: df[cols].groupby('PI_index') \
    ...:         .apply(lambda x: x.set_index('PI_index').to_dict('list')) \
    ...:         .to_dict()
Out[88]:
{255: {'contig': [2, 2, 2, 2], 'hapX_My_Sum': [12.0, 12.0, 11.0, 11.0]},
 726: {'contig': [2, 2, 2], 'hapX_My_Sum': [0.0, 0.0, 0.0]},
 3353: {'contig': [2, 2, 2], 'hapX_My_Sum': [11.0, 11.0, 11.0]}}

Answer 2

这是另一种思考方式。 1. 将您的记录转换为字典数组

df_arrDict = df.to_dict('record')
> [{'PI_index': 726.0,
    'contig': 2.0,
    'hapX_My_Sum': 0.0,
    'hapX_Sp_Sum': 3.5,
    'hapY_My_Sum': 12.0,
    'pos': 16229767.0},
    {....
    ]

2。默认字典有利于分组数据

  from collections import  defaultdict
  by_PI = defaultdict(list)
  for df_dict in df_arrDict:
      feature = df_dict['PI_index']
      by_PI[feature].append(df_dict)

最后转换回字典，你可以使用下面的任何方法或字典理解

by_PI_dict = { int(key):val for key,val in by_PI.items()}
by_PI_dict[255]
[{'PI_index': 255.0,
    'contig': 2.0,
    'hapX_My_Sum': 12.0,
    'hapX_Sp_Sum': 0.0,
    'hapY_My_Sum': 0.0,
    'pos': 16230491.0},
 {'PI_index': 255.0,
    'contig': 2.0,
    'hapX_My_Sum': 12.0,
    'hapX_Sp_Sum': 0.0,
    'hapY_My_Sum': 0.0,
    'pos': 16230503.0},
 {'PI_index': 255.0,
   'contig': 2.0,
   'hapX_My_Sum': 11.0,
   'hapX_Sp_Sum': 0.0,
   'hapY_My_Sum': 1.0,
   'pos': 16232072.0},
 {'PI_index': 255.0,
   'contig': 2.0,
   'hapX_My_Sum': 11.0,
   'hapX_Sp_Sum': 0.0,
   'hapY_My_Sum': 1.0,
   'pos': 16232072.0}]

如何将 pandas 数据框转换为 defaultdict (class, list) 其中一个列值用作键？

How to convert pandas dataframe to defaultdict (class, list) where one of the column values is used as keys?

python

list-comprehension

dataframe

pandas

defaultdict