pandas:当值是可变长度的集合或列表时,从字典创建一个 long/tidy DataFrame
pandas: create a long/tidy DataFrame from dictionary when values are sets or lists of variable length
简单字典:
d = {'a': set([1,2,3]), 'b': set([3, 4])}
(如果重要,集合可以变成列表)
如何将其转换为 long/tidy DataFrame
,其中每一列都是一个变量,每个观察值都是一行,即:
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
下面的可以,但是有点麻烦:
id = 0
tidy_d = {}
for l, vs in d.items():
for v in vs:
tidy_d[id] = {'letter': l, 'value': v}
id += 1
pd.DataFrame.from_dict(tidy_d, orient = 'index')
有什么 pandas
魔法可以做到这一点吗?类似于:
pd.DataFrame([d]).T.reset_index(level=0).unnest()
其中 unnest
显然不存在并且来自 R.
使用numpy.repeat
with chain.from_iterable
:
from itertools import chain
df = pd.DataFrame({
'letter' : np.repeat(list(d.keys()), [len(v) for k, v in d.items()]),
'value' : list(chain.from_iterable(d.values())),
})
print (df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
您可以使用 itertools.chain
和 zip
的理解:
from itertools import chain
keys, values = map(chain.from_iterable, zip(*((k*len(v), v) for k, v in d.items())))
df = pd.DataFrame({'letter': list(keys), 'value': list(values)})
print(df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
可以用更易读的方式重写:
zipper = zip(*((k*len(v), v) for k, v in d.items()))
values = map(list, map(chain.from_iterable, zipper))
df = pd.DataFrame(list(values), columns=['letter', 'value'])
又一个,
from collections import defaultdict
e = defaultdict(list)
for key, val in d.items():
e["letter"] += [key] * len(val)
e["value"] += list(val)
df = pd.DataFrame(e)
更多 "pandaic",灵感来自 this post:
pd.DataFrame.from_dict(d, orient = 'index') \
.rename_axis('letter').reset_index() \
.melt(id_vars = ['letter'], value_name = 'value') \
.drop('variable', axis = 1) \
.dropna()
and slightly modified 回答的一些时间安排:
import random
import timeit
from itertools import chain
import pandas as pd
print(pd.__version__)
dict_size = 1000000
randoms = [random.randint(0, 100) for __ in range(10000)]
max_list_size = 1000
d = {k: random.sample(randoms, random.randint(1, max_list_size)) for k in
range(dict_size)}
def chain_():
keys, values = map(chain.from_iterable,
zip(*(([k] * len(v), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
def melt_():
pd.DataFrame.from_dict(d, orient='index'
).rename_axis('letter').reset_index(
).melt(id_vars=['letter'], value_name='value'
).drop('variable', axis=1).dropna()
setup ="""from __main__ import chain_, melt_"""
repeat = 3
numbers = 10
def timer(statement, _setup=''):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
print('timing')
timer('chain_()')
timer('melt_()')
似乎 max_list_size 100:
的熔化速度更快
1.0.3
timing
246.71311019999996
204.33705529999997
和 max_list_size 1000 的速度较慢:
2675.8446872
4565.838648400002
可能是因为为比需要的大得多的 df 分配了内存
链式答案的变体:
def chain_2():
keys, values = map(chain.from_iterable,
zip(*((itertools.repeat(k, len(v)), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
似乎并没有更快
(python 3.7.6)
简单字典:
d = {'a': set([1,2,3]), 'b': set([3, 4])}
(如果重要,集合可以变成列表)
如何将其转换为 long/tidy DataFrame
,其中每一列都是一个变量,每个观察值都是一行,即:
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
下面的可以,但是有点麻烦:
id = 0
tidy_d = {}
for l, vs in d.items():
for v in vs:
tidy_d[id] = {'letter': l, 'value': v}
id += 1
pd.DataFrame.from_dict(tidy_d, orient = 'index')
有什么 pandas
魔法可以做到这一点吗?类似于:
pd.DataFrame([d]).T.reset_index(level=0).unnest()
其中 unnest
显然不存在并且来自 R.
使用numpy.repeat
with chain.from_iterable
:
from itertools import chain
df = pd.DataFrame({
'letter' : np.repeat(list(d.keys()), [len(v) for k, v in d.items()]),
'value' : list(chain.from_iterable(d.values())),
})
print (df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
您可以使用 itertools.chain
和 zip
的理解:
from itertools import chain
keys, values = map(chain.from_iterable, zip(*((k*len(v), v) for k, v in d.items())))
df = pd.DataFrame({'letter': list(keys), 'value': list(values)})
print(df)
letter value
0 a 1
1 a 2
2 a 3
3 b 3
4 b 4
可以用更易读的方式重写:
zipper = zip(*((k*len(v), v) for k, v in d.items()))
values = map(list, map(chain.from_iterable, zipper))
df = pd.DataFrame(list(values), columns=['letter', 'value'])
又一个,
from collections import defaultdict
e = defaultdict(list)
for key, val in d.items():
e["letter"] += [key] * len(val)
e["value"] += list(val)
df = pd.DataFrame(e)
更多 "pandaic",灵感来自 this post:
pd.DataFrame.from_dict(d, orient = 'index') \
.rename_axis('letter').reset_index() \
.melt(id_vars = ['letter'], value_name = 'value') \
.drop('variable', axis = 1) \
.dropna()
import random
import timeit
from itertools import chain
import pandas as pd
print(pd.__version__)
dict_size = 1000000
randoms = [random.randint(0, 100) for __ in range(10000)]
max_list_size = 1000
d = {k: random.sample(randoms, random.randint(1, max_list_size)) for k in
range(dict_size)}
def chain_():
keys, values = map(chain.from_iterable,
zip(*(([k] * len(v), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
def melt_():
pd.DataFrame.from_dict(d, orient='index'
).rename_axis('letter').reset_index(
).melt(id_vars=['letter'], value_name='value'
).drop('variable', axis=1).dropna()
setup ="""from __main__ import chain_, melt_"""
repeat = 3
numbers = 10
def timer(statement, _setup=''):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
print('timing')
timer('chain_()')
timer('melt_()')
似乎 max_list_size 100:
的熔化速度更快1.0.3
timing
246.71311019999996
204.33705529999997
和 max_list_size 1000 的速度较慢:
2675.8446872
4565.838648400002
可能是因为为比需要的大得多的 df 分配了内存
链式答案的变体:
def chain_2():
keys, values = map(chain.from_iterable,
zip(*((itertools.repeat(k, len(v)), v) for k, v in d.items())))
pd.DataFrame({'letter': list(keys), 'value': list(values)})
似乎并没有更快
(python 3.7.6)