pandas:当值是可变长度的集合或列表时,从字典创建一个 long/tidy DataFrame

pandas: create a long/tidy DataFrame from dictionary when values are sets or lists of variable length

简单字典:

d = {'a': set([1,2,3]), 'b': set([3, 4])}

(如果重要,集合可以变成列表)

如何将其转换为 long/tidy DataFrame,其中每一列都是一个变量,每个观察值都是一行,即:

  letter  value
0      a      1
1      a      2
2      a      3
3      b      3
4      b      4

下面的可以,但是有点麻烦:

id = 0
tidy_d = {}
for l, vs in d.items():
    for v in vs:
        tidy_d[id] = {'letter': l, 'value': v}
        id += 1
pd.DataFrame.from_dict(tidy_d, orient = 'index')

有什么 pandas 魔法可以做到这一点吗?类似于:

pd.DataFrame([d]).T.reset_index(level=0).unnest()

其中 unnest 显然不存在并且来自 R.

使用numpy.repeat with chain.from_iterable:

from itertools import chain

df = pd.DataFrame({
    'letter' : np.repeat(list(d.keys()), [len(v) for k, v in d.items()]),
    'value' : list(chain.from_iterable(d.values())), 
})
print (df)
  letter  value
0      a      1
1      a      2
2      a      3
3      b      3
4      b      4

您可以使用 itertools.chainzip 的理解:

from itertools import chain

keys, values = map(chain.from_iterable, zip(*((k*len(v), v) for k, v in d.items())))

df = pd.DataFrame({'letter': list(keys), 'value': list(values)})

print(df)

  letter  value
0      a      1
1      a      2
2      a      3
3      b      3
4      b      4

可以用更易读的方式重写:

zipper = zip(*((k*len(v), v) for k, v in d.items()))
values = map(list, map(chain.from_iterable, zipper))

df = pd.DataFrame(list(values), columns=['letter', 'value'])

又一个,

from collections import defaultdict
e = defaultdict(list)
for key, val in d.items():
    e["letter"] += [key] * len(val)
    e["value"] += list(val)
df = pd.DataFrame(e)

更多 "pandaic",灵感来自 this post:

pd.DataFrame.from_dict(d, orient = 'index') \
  .rename_axis('letter').reset_index() \
  .melt(id_vars = ['letter'], value_name = 'value') \
  .drop('variable', axis = 1) \
  .dropna()

and slightly modified 回答的一些时间安排:

import random
import timeit
from itertools import chain
import pandas as pd
print(pd.__version__)

dict_size = 1000000
randoms = [random.randint(0, 100) for __ in range(10000)]
max_list_size = 1000
d = {k: random.sample(randoms, random.randint(1, max_list_size)) for k in
     range(dict_size)}

def chain_():
    keys, values = map(chain.from_iterable,
                       zip(*(([k] * len(v), v) for k, v in d.items())))
    pd.DataFrame({'letter': list(keys), 'value': list(values)})

def melt_():
    pd.DataFrame.from_dict(d, orient='index'
        ).rename_axis('letter').reset_index(
        ).melt(id_vars=['letter'], value_name='value'
        ).drop('variable', axis=1).dropna()

setup ="""from __main__ import chain_, melt_"""
repeat = 3
numbers = 10
def timer(statement, _setup=''):
  print(min(
    timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))

print('timing')
timer('chain_()')
timer('melt_()')

似乎 max_list_size 100:

的熔化速度更快
1.0.3
timing
246.71311019999996
204.33705529999997

和 max_list_size 1000 的速度较慢:

2675.8446872
4565.838648400002

可能是因为为比需要的大得多的 df 分配了内存

链式答案的变体:

def chain_2():
    keys, values = map(chain.from_iterable,
                       zip(*((itertools.repeat(k, len(v)), v) for k, v in d.items())))
    pd.DataFrame({'letter': list(keys), 'value': list(values)})

似乎并没有更快

(python 3.7.6)