使用 Pandas 将 2 个字典列表与公共元素合并
Using Pandas to merge 2 list of dicts with common elements
所以我有 2 个听写列表..
list_yearly = [
{'name':'john',
'total_year': 107
},
{'name':'cathy',
'total_year':124
},
]
list_monthly = [
{'name':'john',
'month':'Jan',
'total_month': 34
},
{'name':'cathy',
'month':'Jan',
'total_month':78
},
{'name':'john',
'month':'Feb',
'total_month': 73
},
{'name':'cathy',
'month':'Feb',
'total_month':46
},
]
目标是获得如下所示的最终数据集:
{'name':'john',
'total_year': 107,
'trend':[{'month':'Jan', 'total_month': 34},{'month':'Feb', 'total_month': 73}]
},
{'name':'cathy',
'total_year':124,
'trend':[{'month':'Jan', 'total_month': 78},{'month':'Feb', 'total_month': 46}]
},
因为我的数据集是针对特定年份所有 12 个月的大量学生,所以我使用 Pandas 进行数据处理。这就是我的工作方式:
首先使用 name 键将两个列表组合成一个数据框。
In [5]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
In [6]: df
Out[6]:
name total_year month total_month
0 john 107 Jan 34
1 john 107 Feb 73
2 cathy 124 Jan 78
3 cathy 124 Feb 46
然后创建趋势列作为字典
ln [7]: df['trend'] = df.apply(lambda x: [x[['month', 'total_month']].to_dict()], axis=1)
In [8]: df
Out[8]:
name total_year month total_month \
0 john 107 Jan 34
1 john 107 Feb 73
2 cathy 124 Jan 78
3 cathy 124 Feb 46
trend
0 [{u'total_month': 34, u'month': u'Jan'}]
1 [{u'total_month': 73, u'month': u'Feb'}]
2 [{u'total_month': 78, u'month': u'Jan'}]
3 [{u'total_month': 46, u'month': u'Feb'}]
然后,使用选定列的 to_dict(orient='records')
方法将其转换回字典列表:
In [9]: df[['name', 'total_year', 'trend']].to_dict(orient='records')
Out[9]:
[{'name': 'john',
'total_year': 107,
'trend': [{'month': 'Jan', 'total_month': 34}]},
{'name': 'john',
'total_year': 107,
'trend': [{'month': 'Feb', 'total_month': 73}]},
{'name': 'cathy',
'total_year': 124,
'trend': [{'month': 'Jan', 'total_month': 78}]},
{'name': 'cathy',
'total_year': 124,
'trend': [{'month': 'Feb', 'total_month': 46}]}]
很明显,最终数据集不完全是我 want.Instead 中包含两个月份的 2 个字典,我得到了 4 个包含所有月份的字典 separate.How 我可以修复吗这个 ?我更愿意在 Pandas 本身内修复它,而不是使用这个最终输出再次将它减少到所需的状态
你实际上应该使用 groupby
基于 name
和 total_year
而不是 apply
进行分组(作为第二步)并且在 groupby 中你可以创建列表你要。示例 -
df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
def func(group):
result = []
for idx, row in group.iterrows():
result.append({'month':row['month'],'total_month':row['total_month']})
return result
result = df.groupby(['name','total_year']).apply(func).reset_index()
result.columns = ['name','total_year','trend']
result_dict = result.to_dict(orient='records')
演示 -
In [9]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
In [10]: df
Out[10]:
name total_year month total_month
0 john 107 Jan 34
1 john 107 Feb 73
2 cathy 124 Jan 78
3 cathy 124 Feb 46
In [13]: def func(group):
....: result = []
....: for idx, row in group.iterrows():
....: result.append({'month':row['month'],'total_month':row['total_month']})
....: return result
....:
In [14]:
In [14]: result = df.groupby(['name','total_year']).apply(func).reset_index()
In [15]: result
Out[15]:
name total_year 0
0 cathy 124 [{'month': 'Jan', 'total_month': 78}, {'month'...
1 john 107 [{'month': 'Jan', 'total_month': 34}, {'month'...
In [19]: result.columns = ['name','total_year','trend']
In [20]: result
Out[20]:
name total_year trend
0 cathy 124 [{'month': 'Jan', 'total_month': 78}, {'month'...
1 john 107 [{'month': 'Jan', 'total_month': 34}, {'month'...
In [21]: result.to_dict(orient='records')
Out[21]:
[{'name': 'cathy',
'total_year': 124,
'trend': [{'month': 'Jan', 'total_month': 78},
{'month': 'Feb', 'total_month': 46}]},
{'name': 'john',
'total_year': 107,
'trend': [{'month': 'Jan', 'total_month': 34},
{'month': 'Feb', 'total_month': 73}]}]
在 pandas 内,尝试:
df1 = pd.DataFrame(list_yearly)
df2 = pd.DataFrame(list_monthly)
df = df1.set_index('name').join(pd.DataFrame(df2.groupby('name').apply(\
lambda gp: gp.transpose().to_dict().values())))
更新:从字典中删除名称并转换为字典列表:
df1 = pd.DataFrame(list_yearly)
df2 = pd.DataFrame(list_monthly)
keep_columns = [c for c in df2.columns if not c == 'name']
# within pandas
df = df1.set_index('name').join(pd.DataFrame(df2.groupby('name').apply(\
lambda gp: gp[keep_columns].transpose().to_dict().values()))) \
.reset_index()
data = [row.to_dict() for _, row in df.iterrows()]
将'0'重命名为'trend'。
所以我有 2 个听写列表..
list_yearly = [
{'name':'john',
'total_year': 107
},
{'name':'cathy',
'total_year':124
},
]
list_monthly = [
{'name':'john',
'month':'Jan',
'total_month': 34
},
{'name':'cathy',
'month':'Jan',
'total_month':78
},
{'name':'john',
'month':'Feb',
'total_month': 73
},
{'name':'cathy',
'month':'Feb',
'total_month':46
},
]
目标是获得如下所示的最终数据集:
{'name':'john',
'total_year': 107,
'trend':[{'month':'Jan', 'total_month': 34},{'month':'Feb', 'total_month': 73}]
},
{'name':'cathy',
'total_year':124,
'trend':[{'month':'Jan', 'total_month': 78},{'month':'Feb', 'total_month': 46}]
},
因为我的数据集是针对特定年份所有 12 个月的大量学生,所以我使用 Pandas 进行数据处理。这就是我的工作方式:
首先使用 name 键将两个列表组合成一个数据框。
In [5]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
In [6]: df
Out[6]:
name total_year month total_month
0 john 107 Jan 34
1 john 107 Feb 73
2 cathy 124 Jan 78
3 cathy 124 Feb 46
然后创建趋势列作为字典
ln [7]: df['trend'] = df.apply(lambda x: [x[['month', 'total_month']].to_dict()], axis=1)
In [8]: df
Out[8]:
name total_year month total_month \
0 john 107 Jan 34
1 john 107 Feb 73
2 cathy 124 Jan 78
3 cathy 124 Feb 46
trend
0 [{u'total_month': 34, u'month': u'Jan'}]
1 [{u'total_month': 73, u'month': u'Feb'}]
2 [{u'total_month': 78, u'month': u'Jan'}]
3 [{u'total_month': 46, u'month': u'Feb'}]
然后,使用选定列的 to_dict(orient='records')
方法将其转换回字典列表:
In [9]: df[['name', 'total_year', 'trend']].to_dict(orient='records')
Out[9]:
[{'name': 'john',
'total_year': 107,
'trend': [{'month': 'Jan', 'total_month': 34}]},
{'name': 'john',
'total_year': 107,
'trend': [{'month': 'Feb', 'total_month': 73}]},
{'name': 'cathy',
'total_year': 124,
'trend': [{'month': 'Jan', 'total_month': 78}]},
{'name': 'cathy',
'total_year': 124,
'trend': [{'month': 'Feb', 'total_month': 46}]}]
很明显,最终数据集不完全是我 want.Instead 中包含两个月份的 2 个字典,我得到了 4 个包含所有月份的字典 separate.How 我可以修复吗这个 ?我更愿意在 Pandas 本身内修复它,而不是使用这个最终输出再次将它减少到所需的状态
你实际上应该使用 groupby
基于 name
和 total_year
而不是 apply
进行分组(作为第二步)并且在 groupby 中你可以创建列表你要。示例 -
df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
def func(group):
result = []
for idx, row in group.iterrows():
result.append({'month':row['month'],'total_month':row['total_month']})
return result
result = df.groupby(['name','total_year']).apply(func).reset_index()
result.columns = ['name','total_year','trend']
result_dict = result.to_dict(orient='records')
演示 -
In [9]: df = pd.DataFrame(list_yearly).merge(pd.DataFrame(list_monthly))
In [10]: df
Out[10]:
name total_year month total_month
0 john 107 Jan 34
1 john 107 Feb 73
2 cathy 124 Jan 78
3 cathy 124 Feb 46
In [13]: def func(group):
....: result = []
....: for idx, row in group.iterrows():
....: result.append({'month':row['month'],'total_month':row['total_month']})
....: return result
....:
In [14]:
In [14]: result = df.groupby(['name','total_year']).apply(func).reset_index()
In [15]: result
Out[15]:
name total_year 0
0 cathy 124 [{'month': 'Jan', 'total_month': 78}, {'month'...
1 john 107 [{'month': 'Jan', 'total_month': 34}, {'month'...
In [19]: result.columns = ['name','total_year','trend']
In [20]: result
Out[20]:
name total_year trend
0 cathy 124 [{'month': 'Jan', 'total_month': 78}, {'month'...
1 john 107 [{'month': 'Jan', 'total_month': 34}, {'month'...
In [21]: result.to_dict(orient='records')
Out[21]:
[{'name': 'cathy',
'total_year': 124,
'trend': [{'month': 'Jan', 'total_month': 78},
{'month': 'Feb', 'total_month': 46}]},
{'name': 'john',
'total_year': 107,
'trend': [{'month': 'Jan', 'total_month': 34},
{'month': 'Feb', 'total_month': 73}]}]
在 pandas 内,尝试:
df1 = pd.DataFrame(list_yearly)
df2 = pd.DataFrame(list_monthly)
df = df1.set_index('name').join(pd.DataFrame(df2.groupby('name').apply(\
lambda gp: gp.transpose().to_dict().values())))
更新:从字典中删除名称并转换为字典列表:
df1 = pd.DataFrame(list_yearly)
df2 = pd.DataFrame(list_monthly)
keep_columns = [c for c in df2.columns if not c == 'name']
# within pandas
df = df1.set_index('name').join(pd.DataFrame(df2.groupby('name').apply(\
lambda gp: gp[keep_columns].transpose().to_dict().values()))) \
.reset_index()
data = [row.to_dict() for _, row in df.iterrows()]
将'0'重命名为'trend'。