如何将数组字典转换为 'flattened' 数据框?
How can I convert a dict of arrays into a 'flattened' dataframe?
假设我有一个数组字典,例如:
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
我想将其转换为 pandas 数据框,列为“Flavour”和“Person”。它应该是这样的:
Flavour
Person
vanilla
Josh
banana
Josh
chocolate
Greg
mint
Sarah
vanilla
Sarah
mango
Sarah
最有效的方法是什么?
您可以使用(生成器)理解,然后将其提供给 pd.DataFrame
:
import pandas as pd
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
data = ((flavour, person)
for person, flavours in favourite_icecreams.items()
for flavour in flavours)
df = pd.DataFrame(data, columns=('Flavour', 'Person'))
print(df)
# Flavour Person
# 0 vanilla Josh
# 1 banana Josh
# 2 chocolate Greg
# 3 mint Sarah
# 4 vanilla Sarah
# 5 mango Sarah
您可以完全在 pandas 中使用 DataFrame.from_dict
and df.stack
:
In [453]: df = pd.DataFrame.from_dict(favourite_icecreams, orient='index').stack().reset_index().drop('level_1', 1)
In [455]: df.columns = ['Person', 'Flavour']
In [456]: df
Out[456]:
Person Flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
一个选项是将 person 和 flavor 提取到单独的列表中,在 person
列表上使用 numpy repeat,最后创建 DataFrame:
from itertools import chain
person, flavour = zip(*favourite_icecreams.items())
lengths = list(map(len, flavour))
person = np.array(person).repeat(lengths)
flavour = chain.from_iterable(flavour)
pd.DataFrame({'person':person, 'flavour':flavour})
person flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
另一个解决方案,使用.explode()
:
df = pd.DataFrame(
{
"Person": favourite_icecreams.keys(),
"Flavour": favourite_icecreams.values(),
}
).explode("Flavour")
print(df)
打印:
Person Flavour
0 Josh vanilla
0 Josh banana
1 Greg chocolate
2 Sarah mint
2 Sarah vanilla
2 Sarah mango
假设我有一个数组字典,例如:
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
我想将其转换为 pandas 数据框,列为“Flavour”和“Person”。它应该是这样的:
Flavour | Person |
---|---|
vanilla | Josh |
banana | Josh |
chocolate | Greg |
mint | Sarah |
vanilla | Sarah |
mango | Sarah |
最有效的方法是什么?
您可以使用(生成器)理解,然后将其提供给 pd.DataFrame
:
import pandas as pd
favourite_icecreams = {
'Josh': ['vanilla', 'banana'],
'Greg': ['chocolate'],
'Sarah': ['mint', 'vanilla', 'mango']
}
data = ((flavour, person)
for person, flavours in favourite_icecreams.items()
for flavour in flavours)
df = pd.DataFrame(data, columns=('Flavour', 'Person'))
print(df)
# Flavour Person
# 0 vanilla Josh
# 1 banana Josh
# 2 chocolate Greg
# 3 mint Sarah
# 4 vanilla Sarah
# 5 mango Sarah
您可以完全在 pandas 中使用 DataFrame.from_dict
and df.stack
:
In [453]: df = pd.DataFrame.from_dict(favourite_icecreams, orient='index').stack().reset_index().drop('level_1', 1)
In [455]: df.columns = ['Person', 'Flavour']
In [456]: df
Out[456]:
Person Flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
一个选项是将 person 和 flavor 提取到单独的列表中,在 person
列表上使用 numpy repeat,最后创建 DataFrame:
from itertools import chain
person, flavour = zip(*favourite_icecreams.items())
lengths = list(map(len, flavour))
person = np.array(person).repeat(lengths)
flavour = chain.from_iterable(flavour)
pd.DataFrame({'person':person, 'flavour':flavour})
person flavour
0 Josh vanilla
1 Josh banana
2 Greg chocolate
3 Sarah mint
4 Sarah vanilla
5 Sarah mango
另一个解决方案,使用.explode()
:
df = pd.DataFrame(
{
"Person": favourite_icecreams.keys(),
"Flavour": favourite_icecreams.values(),
}
).explode("Flavour")
print(df)
打印:
Person Flavour
0 Josh vanilla
0 Josh banana
1 Greg chocolate
2 Sarah mint
2 Sarah vanilla
2 Sarah mango