使用列表和 for 循环生成多个新的 pandas 数据帧
Generate multiple new pandas dataframes using lists and for loops
我有以下数据框:
import pandas as pd
import numpy as np
from numpy import rec, nan
df1=pd.DataFrame.from_records(rec.array([(202001L, 2020L, 'apples', 'CA', 100L),
(202002L, 2020L, 'apples', 'CA', 150L),
(202001L, 2020L, 'apples', 'USA', 400L),
(202002L, 2020L, 'apples', 'USA', 675L),
(202001L, 2020L, 'oranges', 'CA', 50L),
(202002L, 2020L, 'oranges', 'CA', 65L),
(202001L, 2020L, 'oranges', 'USA', 175L),
(202002L, 2020L, 'oranges', 'USA', 390L)],
dtype=[('yyyymm', '<i8'), ('year', '<i8'), ('prod', 'O'), ('country', 'O'), ('rev', '<i8')]))
我需要:
a) 获取 df1 并按国家拆分...使用 df1 中的国家/地区名称申请创建 df2_CA、df2_USA.
b) 生成另外 2 个 df,按产品显示年度总销售额(示例使用 2020 年,仅两个月),使用后缀 '_annual'...所以我们得到 df2_CA_annual,df2_USA_每年一次。
最终结果:
问题:我的实际用例有十几个削减,我想保持编码紧凑。我想我可以 使用列表和 for 循环来创建我想要的最终 dfs 来节省时间。如何修复下面的代码?
# 1. Get dfs by country: CA and USA...df2_(country):
for x in df1['country'].unique():
locals()['df2_' + x ] = df1[(df1['country'] == x ) ]
#2. Take dfs from step 1, calculate total revenue by year. Create frames df2_(country)_annual:
mylist=[df2_CA, df2_USA]
for x in mylist['country'].unique():
locals()[ 'df2_' + x + '_annual' ] = mylist[(mylist['country'] == x )]
x = x.groupby(['year', 'prod','country']).sum()[["rev"]]
如果您只需要 DF 列表,那么以下内容可能会有所帮助:
import pandas as pd
import numpy as np
from numpy import rec, nan
df1=pd.DataFrame.from_records(rec.array([(202001, 2020, 'apples', 'CA', 100),
(202002, 2020, 'apples', 'CA', 150),
(202001, 2020, 'apples', 'USA', 400),
(202002, 2020, 'apples', 'USA', 675),
(202001, 2020, 'oranges', 'CA', 50),
(202002, 2020, 'oranges', 'CA', 65),
(202001, 2020, 'oranges', 'USA', 175),
(202002, 2020, 'oranges', 'USA', 390)],
dtype=[('yyyymm', '<i8'), ('year', '<i8'), ('prod', 'O'), ('country', 'O'), ('rev', '<i8')]))
final_df_list = list()
for col in df1.country.unique():
final_df_list.append(df1.where(df1.country == col).groupby(['year', 'prod','country']).sum()[["rev"]])
@VMSMani 回答的另一种方法是利用字典:
df_annual = {}
for c in df1['country'].unique():
df_annual[c] = df1.where(df1['country'] == c).groupby(['year', 'prod', 'country']).sum()[['rev']]
“唯一”的区别是您可以按键存储所有数据帧,因此您可以稍后通过调用 df_annual[country]
将它们取回,我想这样可以使事情更整洁。
我有以下数据框:
import pandas as pd
import numpy as np
from numpy import rec, nan
df1=pd.DataFrame.from_records(rec.array([(202001L, 2020L, 'apples', 'CA', 100L),
(202002L, 2020L, 'apples', 'CA', 150L),
(202001L, 2020L, 'apples', 'USA', 400L),
(202002L, 2020L, 'apples', 'USA', 675L),
(202001L, 2020L, 'oranges', 'CA', 50L),
(202002L, 2020L, 'oranges', 'CA', 65L),
(202001L, 2020L, 'oranges', 'USA', 175L),
(202002L, 2020L, 'oranges', 'USA', 390L)],
dtype=[('yyyymm', '<i8'), ('year', '<i8'), ('prod', 'O'), ('country', 'O'), ('rev', '<i8')]))
我需要:
a) 获取 df1 并按国家拆分...使用 df1 中的国家/地区名称申请创建 df2_CA、df2_USA.
b) 生成另外 2 个 df,按产品显示年度总销售额(示例使用 2020 年,仅两个月),使用后缀 '_annual'...所以我们得到 df2_CA_annual,df2_USA_每年一次。
最终结果:
问题:我的实际用例有十几个削减,我想保持编码紧凑。我想我可以 使用列表和 for 循环来创建我想要的最终 dfs 来节省时间。如何修复下面的代码?
# 1. Get dfs by country: CA and USA...df2_(country):
for x in df1['country'].unique():
locals()['df2_' + x ] = df1[(df1['country'] == x ) ]
#2. Take dfs from step 1, calculate total revenue by year. Create frames df2_(country)_annual:
mylist=[df2_CA, df2_USA]
for x in mylist['country'].unique():
locals()[ 'df2_' + x + '_annual' ] = mylist[(mylist['country'] == x )]
x = x.groupby(['year', 'prod','country']).sum()[["rev"]]
如果您只需要 DF 列表,那么以下内容可能会有所帮助:
import pandas as pd
import numpy as np
from numpy import rec, nan
df1=pd.DataFrame.from_records(rec.array([(202001, 2020, 'apples', 'CA', 100),
(202002, 2020, 'apples', 'CA', 150),
(202001, 2020, 'apples', 'USA', 400),
(202002, 2020, 'apples', 'USA', 675),
(202001, 2020, 'oranges', 'CA', 50),
(202002, 2020, 'oranges', 'CA', 65),
(202001, 2020, 'oranges', 'USA', 175),
(202002, 2020, 'oranges', 'USA', 390)],
dtype=[('yyyymm', '<i8'), ('year', '<i8'), ('prod', 'O'), ('country', 'O'), ('rev', '<i8')]))
final_df_list = list()
for col in df1.country.unique():
final_df_list.append(df1.where(df1.country == col).groupby(['year', 'prod','country']).sum()[["rev"]])
@VMSMani 回答的另一种方法是利用字典:
df_annual = {}
for c in df1['country'].unique():
df_annual[c] = df1.where(df1['country'] == c).groupby(['year', 'prod', 'country']).sum()[['rev']]
“唯一”的区别是您可以按键存储所有数据帧,因此您可以稍后通过调用 df_annual[country]
将它们取回,我想这样可以使事情更整洁。