如何为仅一个索引级别的具有附加行的 MultiIndex 重新编制索引?
How do I reindex a MultiIndex with additional Rows for only one Index Level?
我有以下数据框:
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
我想创建一个 dataframe/Multiindex,其中包含任何记录中出现的每个日期的行。我想对音量列使用 fill_value=0
。
但是我不想添加任何其他索引级别组合,例如。我不想为包含 Source SA 和 Brand BC 的索引添加一行,因为没有已知的两列组合。
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-01-01 SA BB 0 # Row to be added.
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
我在没有索引的情况下使用窗口完成了此操作,但速度非常慢(这个 df 相当大)。
我尝试使用这种方法来做到这一点: 设置如下所示:
df_dates = df.groupby(['month']).sum() # df is the df with just a range index.
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
new_index_cols,
names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)
但我意识到 MultiIndex.from_product
会显示所有索引列的每一个组合,这是我不想实现的,而且还需要更多的内存。
在我看来,我可以通过使用 index.get_level_values(i)
和 MultiIndex.from_array
自己操作完整的索引列,但我希望找到一个比这更简单的过程。
这个过程必须是通用的,因为我需要将它应用到具有不同索引列值但都共享索引中相同的卷列和月份列的数据帧。
拥有数据框时:
volume
month source brand
2020-01-01 A A 5.0
2020-02-01 A A 10.0
2020-02-01 A B 5.0
我相信您需要为月份、来源、品牌的每个独特组合排成一行?
这有帮助吗?
months = df.index.unique(level=0)
source = df.index.unique(level=1)
brands = df.index.unique(level=2)
df2 = pd.DataFrame(
index = pd.MultiIndex.from_product(
[months, source, brands])
).rename_axis(['month','source','brand'])
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)
This yields:
volume
month source brand
2020-01-01 A A 5.0
2020-01-01 A B 0.0
2020-02-01 A A 10.0
2020-02-01 A B 5.0
更新 1
No sorry I was not clearer will update the original as well. I want only to have the dates filled up, but not all combinations of source/brand because otherwise I'd have billions of rows. (MY current error messages when running so if I have no Brand A in Source B then I only want to see one row per date for Brand A/Source A.
试试这个来改变:
months = df.index.unique(level=0)
sourcebrand = df.groupby(level=['source','brand']).size().index
tuples = [(m,) + sb for sb in sourcebrand for m in months]
df2 = pd.DataFrame(index = pd.MultiIndex.from_tuples(tuples, names=['month','source','brand']))
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)
这产生:
volume
month source brand
2020-01-01 SA BA 5.0
2020-02-01 SA BA 10.0
2020-01-01 SA BB 0.0
2020-02-01 SA BB 5.0
2020-01-01 SB BC 5.0
2020-02-01 SB BC 10.0
更新 2
当索引不唯一时(f.e。month=2020-01-01,source=SB,brand=BC 的 2 个值;如您自己的回答所示),您可以轻松地对此求和之后使用:
.groupby(level=[0,1,2]).sum()
我最终找到了我的解决方案,尽管它比我喜欢的更复杂。我在通用函数中添加了它:
def index_fill_missing(df, index_cols, fill_col, fill_value=0):
"""
Finds all the unique values of the column 'fill_col' in df and
returns a dataframe with an index based on index_cols + fill_col where the
a new row is added for any rows where the value in fill_col did not
previously exist in the dataframe.
The additional values are set to the value of the parameter 'fill_value'
Parameters:
df pandas.DataFrame: the dataframe
index_cols list(str): the list of column names to use in the index column
fill_col (str): the column name for which all values should appear in every
single index.
fill_value (any): the value to fill the metric columns in new rows.
Returns:
pandasPdateframe: DataFrame with MultiINdex and additional rows.
"""
# Get unique values for the fill_col.
fill_val_list = df[fill_col].unique().tolist()
# Create a dataframe with the reduced index and get a list of tuples
# with the index values.
df_i = df.set_index(index_cols)
df_i_tup = df_i.index.unique().tolist()
# Append the fill col values to each and every of these index tuples.
df_f_tup = []
col_names = list(index_cols)
col_names.append(fill_col)
print(col_names)
for tup in df_i_tup:
for fill_val in fill_val_list:
df_f_tup.append(tup + (fill_val,))
# Create an index based on these tuples and reindex the dataframe.
idx_f = pd.MultiIndex.from_tuples(df_f_tup, names=col_names)
print(idx_f)
# We can only reindex if there are no duplicate values
# Hence the groupby with sum function.
df_g = df.groupby(by=col_names).sum()
df_f = df_g.reindex(index=idx_f, fill_value=fill_value)
return df_f
正在创建示例数据框:
'2020-01-01', '2020-02-01',
'2020-02-01',
'2020-01-01', '2020-02-01']
brands = ['BA','BA','BB','BC','BC']
sources = ['SA', 'SA', 'SA', 'SB', 'SB']
volumes1 = [5, 10, 5, 5, 10]
volumes2 = [5, 10, 5, 5, 10]
df = pd.DataFrame(
list(zip(dates, brands, sources, volumes1, volumes2)),
columns=['month', 'brand', 'source', 'volume1', 'volume2']
)
df
结果输出:
month brand source volume1 volume2
0 2020-01-01 BA SA 5 5
1 2020-02-01 BA SA 10 10
2 2020-02-01 BB SA 5 5
3 2020-01-01 BC SB 5 5
4 2020-02-01 BC SB 10 10
并应用函数:
df2 = index_fill_missing(df, ['source', 'brand'], 'month')
df2
结果输出:
volume1 volume2
source brand month
SA BA 2020-01-01 5 5
2020-02-01 10 10
BB 2020-01-01 0 0
2020-02-01 5 5
SB BC 2020-01-01 15 15
2020-02-01 0 0
我经常发现 MultiIndexes 比它们的价值更麻烦,所以这里有一个 'straight' 或至少 traditional/relational 替代你的 index_fill_missing 函数。
注意:要求 Pandas >= 1.2 用于 .merge(.., how='cross')
从您最近回答中的数据框开始:
month brand source volume1 volume2
0 2020-01-01 BA SA 5 5
1 2020-02-01 BA SA 10 10
2 2020-02-01 BB SA 5 5
3 2020-01-01 BC SB 5 5
4 2020-01-01 BC SB 10 10
第一步是汇总每月的值:
df = (df.groupby(['month','source','brand'], as_index=False)
.agg( {'volume1': np.sum, 'volume2': np.sum } ) )
创建一个 'basis' 数据框,其中包含所有月份,并与所有流行的来源品牌组合交叉:
months = pd.DataFrame(df.month.drop_duplicates())
source_brand_combinations = df[['source','brand']].drop_duplicates()
basis = months.merge(source_brand_combinations, how='cross')
将 'basis' 与源数据合并,填充可用的实际体积 - 并填充未提供值的 fillna(0):
result = basis.merge( df, on=['month','source','brand'], how='left').fillna(0)
result[['volume1','volume2']] = result[['volume1','volume2']].astype(int)
month source brand volume1 volume2
0 2020-01-01 SA BA 5 5
1 2020-01-01 SB BC 15 15
2 2020-01-01 SA BB 0 0
3 2020-02-01 SA BA 10 10
4 2020-02-01 SB BC 0 0
5 2020-02-01 SA BB 5 5
...如果您希望它具有多索引:
result.set_index(['source','brand','month']).sort_values(['source','brand','month'])
volume1 volume2
source brand month
SA BA 2020-01-01 5 5
2020-02-01 10 10
BB 2020-01-01 0 0
2020-02-01 5 5
SB BC 2020-01-01 15 15
2020-02-01 0 0
我有以下数据框:
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
我想创建一个 dataframe/Multiindex,其中包含任何记录中出现的每个日期的行。我想对音量列使用 fill_value=0
。
但是我不想添加任何其他索引级别组合,例如。我不想为包含 Source SA 和 Brand BC 的索引添加一行,因为没有已知的两列组合。
volume
month source brand
2020-01-01 SA BA 5
2020-02-01 SA BA 10
2020-01-01 SA BB 0 # Row to be added.
2020-02-01 SA BB 5
2020-01-01 SB BC 5
2020-02-01 SB BC 10
我在没有索引的情况下使用窗口完成了此操作,但速度非常慢(这个 df 相当大)。
我尝试使用这种方法来做到这一点:
df_dates = df.groupby(['month']).sum() # df is the df with just a range index.
idx = df_b.index # df_b is the existing df with MultiIndex and missing rows.
ilen = len(idx.levels)
list(range(ilen-1))
new_index_cols = [idx.levels[i] for i in range(ilen - 1)]
new_index_cols.append(df_dates.index)
new_index = pd.MultiIndex.from_product(
new_index_cols,
names=index_columns_b
)
df_b.reindex(new_index, fill_value=0)
但我意识到 MultiIndex.from_product
会显示所有索引列的每一个组合,这是我不想实现的,而且还需要更多的内存。
在我看来,我可以通过使用 index.get_level_values(i)
和 MultiIndex.from_array
自己操作完整的索引列,但我希望找到一个比这更简单的过程。
这个过程必须是通用的,因为我需要将它应用到具有不同索引列值但都共享索引中相同的卷列和月份列的数据帧。
拥有数据框时:
volume
month source brand
2020-01-01 A A 5.0
2020-02-01 A A 10.0
2020-02-01 A B 5.0
我相信您需要为月份、来源、品牌的每个独特组合排成一行?
这有帮助吗?
months = df.index.unique(level=0)
source = df.index.unique(level=1)
brands = df.index.unique(level=2)
df2 = pd.DataFrame(
index = pd.MultiIndex.from_product(
[months, source, brands])
).rename_axis(['month','source','brand'])
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)
This yields:
volume
month source brand
2020-01-01 A A 5.0
2020-01-01 A B 0.0
2020-02-01 A A 10.0
2020-02-01 A B 5.0
更新 1
No sorry I was not clearer will update the original as well. I want only to have the dates filled up, but not all combinations of source/brand because otherwise I'd have billions of rows. (MY current error messages when running so if I have no Brand A in Source B then I only want to see one row per date for Brand A/Source A.
试试这个来改变:
months = df.index.unique(level=0)
sourcebrand = df.groupby(level=['source','brand']).size().index
tuples = [(m,) + sb for sb in sourcebrand for m in months]
df2 = pd.DataFrame(index = pd.MultiIndex.from_tuples(tuples, names=['month','source','brand']))
df2.merge(df, left_index=True, right_index=True, how= 'left').fillna(0)
这产生:
volume
month source brand
2020-01-01 SA BA 5.0
2020-02-01 SA BA 10.0
2020-01-01 SA BB 0.0
2020-02-01 SA BB 5.0
2020-01-01 SB BC 5.0
2020-02-01 SB BC 10.0
更新 2
当索引不唯一时(f.e。month=2020-01-01,source=SB,brand=BC 的 2 个值;如您自己的回答所示),您可以轻松地对此求和之后使用:
.groupby(level=[0,1,2]).sum()
我最终找到了我的解决方案,尽管它比我喜欢的更复杂。我在通用函数中添加了它:
def index_fill_missing(df, index_cols, fill_col, fill_value=0):
"""
Finds all the unique values of the column 'fill_col' in df and
returns a dataframe with an index based on index_cols + fill_col where the
a new row is added for any rows where the value in fill_col did not
previously exist in the dataframe.
The additional values are set to the value of the parameter 'fill_value'
Parameters:
df pandas.DataFrame: the dataframe
index_cols list(str): the list of column names to use in the index column
fill_col (str): the column name for which all values should appear in every
single index.
fill_value (any): the value to fill the metric columns in new rows.
Returns:
pandasPdateframe: DataFrame with MultiINdex and additional rows.
"""
# Get unique values for the fill_col.
fill_val_list = df[fill_col].unique().tolist()
# Create a dataframe with the reduced index and get a list of tuples
# with the index values.
df_i = df.set_index(index_cols)
df_i_tup = df_i.index.unique().tolist()
# Append the fill col values to each and every of these index tuples.
df_f_tup = []
col_names = list(index_cols)
col_names.append(fill_col)
print(col_names)
for tup in df_i_tup:
for fill_val in fill_val_list:
df_f_tup.append(tup + (fill_val,))
# Create an index based on these tuples and reindex the dataframe.
idx_f = pd.MultiIndex.from_tuples(df_f_tup, names=col_names)
print(idx_f)
# We can only reindex if there are no duplicate values
# Hence the groupby with sum function.
df_g = df.groupby(by=col_names).sum()
df_f = df_g.reindex(index=idx_f, fill_value=fill_value)
return df_f
正在创建示例数据框:
'2020-01-01', '2020-02-01',
'2020-02-01',
'2020-01-01', '2020-02-01']
brands = ['BA','BA','BB','BC','BC']
sources = ['SA', 'SA', 'SA', 'SB', 'SB']
volumes1 = [5, 10, 5, 5, 10]
volumes2 = [5, 10, 5, 5, 10]
df = pd.DataFrame(
list(zip(dates, brands, sources, volumes1, volumes2)),
columns=['month', 'brand', 'source', 'volume1', 'volume2']
)
df
结果输出:
month brand source volume1 volume2
0 2020-01-01 BA SA 5 5
1 2020-02-01 BA SA 10 10
2 2020-02-01 BB SA 5 5
3 2020-01-01 BC SB 5 5
4 2020-02-01 BC SB 10 10
并应用函数:
df2 = index_fill_missing(df, ['source', 'brand'], 'month')
df2
结果输出:
volume1 volume2
source brand month
SA BA 2020-01-01 5 5
2020-02-01 10 10
BB 2020-01-01 0 0
2020-02-01 5 5
SB BC 2020-01-01 15 15
2020-02-01 0 0
我经常发现 MultiIndexes 比它们的价值更麻烦,所以这里有一个 'straight' 或至少 traditional/relational 替代你的 index_fill_missing 函数。
注意:要求 Pandas >= 1.2 用于 .merge(.., how='cross')
从您最近回答中的数据框开始:
month brand source volume1 volume2
0 2020-01-01 BA SA 5 5
1 2020-02-01 BA SA 10 10
2 2020-02-01 BB SA 5 5
3 2020-01-01 BC SB 5 5
4 2020-01-01 BC SB 10 10
第一步是汇总每月的值:
df = (df.groupby(['month','source','brand'], as_index=False)
.agg( {'volume1': np.sum, 'volume2': np.sum } ) )
创建一个 'basis' 数据框,其中包含所有月份,并与所有流行的来源品牌组合交叉:
months = pd.DataFrame(df.month.drop_duplicates())
source_brand_combinations = df[['source','brand']].drop_duplicates()
basis = months.merge(source_brand_combinations, how='cross')
将 'basis' 与源数据合并,填充可用的实际体积 - 并填充未提供值的 fillna(0):
result = basis.merge( df, on=['month','source','brand'], how='left').fillna(0)
result[['volume1','volume2']] = result[['volume1','volume2']].astype(int)
month source brand volume1 volume2
0 2020-01-01 SA BA 5 5
1 2020-01-01 SB BC 15 15
2 2020-01-01 SA BB 0 0
3 2020-02-01 SA BA 10 10
4 2020-02-01 SB BC 0 0
5 2020-02-01 SA BB 5 5
...如果您希望它具有多索引:
result.set_index(['source','brand','month']).sort_values(['source','brand','month'])
volume1 volume2
source brand month
SA BA 2020-01-01 5 5
2020-02-01 10 10
BB 2020-01-01 0 0
2020-02-01 5 5
SB BC 2020-01-01 15 15
2020-02-01 0 0