以日期作为列值重塑数据
Reshaping data with dates as column values
我正在尝试使用 pandas 重塑数据,但一直很难将其转换为正确的格式。粗略地说,数据看起来像这样*:
df = pd.DataFrame({'PRODUCT':['1','2'],
'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]})
print(df)
PRODUCT DESIGN_START DESIGN_COMPLETE PRODUCTION_START PRODUCTION_COMPLETE
0 1 2020-01-05 2020-01-22 2020-02-07 NaT
1 2 2020-01-17 2020-03-04 2020-03-15 2020-04-28
我想重塑数据,使其看起来像这样:
reshaped_df = pd.DataFrame({'DATE':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17'),
pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04'),
pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15'),
np.nan,pd.Timestamp('2020-04-28')],
'STAGE':['design','design','design','design','production','production','production','production'],
'STATUS':['started','started','completed','completed','started','started','completed','completed']})
print(reshaped_df)
DATE STAGE STATUS
0 2020-01-05 design started
1 2020-01-17 design started
2 2020-01-22 design completed
3 2020-03-04 design completed
4 2020-02-07 production started
5 2020-03-15 production started
6 NaT production completed
7 2020-04-28 production completed
我该怎么做呢?有没有更好的格式来重塑它?
最后我想对数据做一些分组总结,比如每一步出现的次数,例如
reshaped_df.groupby(['STAGE','STATUS'])['DATE'].count()
STAGE STATUS
design completed 2
started 2
production completed 1
started 2
Name: DATE, dtype: int64
谢谢
- 数据实际上包含制造管道不同阶段的许多日期 start/stop 列
删除PRODUCT
,将列修改为MultiIndex并将它们堆叠:
new_cols = pd.MultiIndex.from_product([['design', 'production'], ['started', 'completed']], names=['STAGE', 'STATUS'])
df.drop(columns='PRODUCT') \
.set_axis(new_cols, axis=1) \
.stack([0,1]) \
.groupby(['STAGE', 'STATUS']) \
.count()
我们可以用 stack
做 pd.wide_to_long
并重新排序 df
s=pd.wide_to_long(df,['DESIGN','PRODUCTION'],i='PRODUCT',j='STATUS',suffix='\w+',sep='_').\
stack(dropna=False).reset_index(level=[1,2]).sort_values('level_2').\
reset_index(drop=True).rename(columns={'level_2':'STAGE',0:'DATE'})
STATUS STAGE DATE
0 START DESIGN 2020-01-05
1 START DESIGN 2020-01-17
2 COMPLETE DESIGN 2020-01-22
3 COMPLETE DESIGN 2020-03-04
4 START PRODUCTION 2020-02-07
5 START PRODUCTION 2020-03-15
6 COMPLETE PRODUCTION NaT
7 COMPLETE PRODUCTION 2020-04-28
融化它!!!
import pandas as pd
import numpy as np
df = pd.DataFrame({
'PRODUCT':['1','2'],
'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]
})
df = df.melt(id_vars=['PRODUCT'])
df_split = df['variable'].str.split('_', n=1, expand=True)
df['STAGE'] = df_split[0]
df['STATUS'] = df_split[1]
df.drop(columns=['variable'], inplace=True)
df = df.rename(columns={'value': 'DATE'})
print(df)
输出:
PRODUCT DATE STAGE STATUS
0 1 2020-01-05 DESIGN START
1 2 2020-01-17 DESIGN START
2 1 2020-01-22 DESIGN COMPLETE
3 2 2020-03-04 DESIGN COMPLETE
4 1 2020-02-07 PRODUCTION START
5 2 2020-03-15 PRODUCTION START
6 1 NaT PRODUCTION COMPLETE
7 2 2020-04-28 PRODUCTION COMPLETE
哇哈哈哈哈哈!!!感受融化的力量!!!
Melt 基本上是逆向的
将“_”上的列转换为 lowercase and split ...设置 expand=True 将其转换为 MultiIndex:
df.columns = df.columns.str.lower().str.split('_',expand=True)
df.columns = df.columns.set_names(['stage','status'])
print(df)
product design production
NaN start complete start complete
0 1 2020-01-05 2020-01-22 2020-02-07 NaT
1 2 2020-01-17 2020-03-04 2020-03-15 2020-04-28
下一阶段是 stack, sort values, droplevel, reset index, and reindex 的组合:
res = (df
.stack([0,1])
.sort_values()
.droplevel(0)
.reset_index(name='Date')
.reindex(['Date','stage','status'],axis=1)
)
res
DATE STAGE STATUS
0 2020-01-05 design start
1 2020-01-17 design start
2 2020-01-22 design complete
3 2020-02-07 production start
4 2020-03-04 design complete
5 2020-03-15 production start
6 2020-04-28 production complete
如果你只对分组和聚合感兴趣,那么你可以跳过长路径,直接在堆栈之后起飞:
df.stack([0,1]).groupby(['stage','status']).count()
stage status
design complete 2
start 2
production complete 1
start 2
Name: Date, dtype: int64
更新 2021/06/01:
您可以使用 pivot_longer function from pyjanitor to abstract the reshaping; at the moment you have to install the latest development version from github:
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.rename(columns=str.lower).pivot_longer(
index="product",
names_sep="_",
names_to=("stage", "status"),
values_to="date",
)
product stage status date
0 1 design start 2020-01-05
1 2 design start 2020-01-17
2 1 design complete 2020-01-22
3 2 design complete 2020-03-04
4 1 production start 2020-02-07
5 2 production start 2020-03-15
6 1 production complete NaT
7 2 production complete 2020-04-28
我正在尝试使用 pandas 重塑数据,但一直很难将其转换为正确的格式。粗略地说,数据看起来像这样*:
df = pd.DataFrame({'PRODUCT':['1','2'],
'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]})
print(df)
PRODUCT DESIGN_START DESIGN_COMPLETE PRODUCTION_START PRODUCTION_COMPLETE
0 1 2020-01-05 2020-01-22 2020-02-07 NaT
1 2 2020-01-17 2020-03-04 2020-03-15 2020-04-28
我想重塑数据,使其看起来像这样:
reshaped_df = pd.DataFrame({'DATE':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17'),
pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04'),
pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15'),
np.nan,pd.Timestamp('2020-04-28')],
'STAGE':['design','design','design','design','production','production','production','production'],
'STATUS':['started','started','completed','completed','started','started','completed','completed']})
print(reshaped_df)
DATE STAGE STATUS
0 2020-01-05 design started
1 2020-01-17 design started
2 2020-01-22 design completed
3 2020-03-04 design completed
4 2020-02-07 production started
5 2020-03-15 production started
6 NaT production completed
7 2020-04-28 production completed
我该怎么做呢?有没有更好的格式来重塑它?
最后我想对数据做一些分组总结,比如每一步出现的次数,例如
reshaped_df.groupby(['STAGE','STATUS'])['DATE'].count()
STAGE STATUS
design completed 2
started 2
production completed 1
started 2
Name: DATE, dtype: int64
谢谢
- 数据实际上包含制造管道不同阶段的许多日期 start/stop 列
删除PRODUCT
,将列修改为MultiIndex并将它们堆叠:
new_cols = pd.MultiIndex.from_product([['design', 'production'], ['started', 'completed']], names=['STAGE', 'STATUS'])
df.drop(columns='PRODUCT') \
.set_axis(new_cols, axis=1) \
.stack([0,1]) \
.groupby(['STAGE', 'STATUS']) \
.count()
我们可以用 stack
做 pd.wide_to_long
并重新排序 df
s=pd.wide_to_long(df,['DESIGN','PRODUCTION'],i='PRODUCT',j='STATUS',suffix='\w+',sep='_').\
stack(dropna=False).reset_index(level=[1,2]).sort_values('level_2').\
reset_index(drop=True).rename(columns={'level_2':'STAGE',0:'DATE'})
STATUS STAGE DATE
0 START DESIGN 2020-01-05
1 START DESIGN 2020-01-17
2 COMPLETE DESIGN 2020-01-22
3 COMPLETE DESIGN 2020-03-04
4 START PRODUCTION 2020-02-07
5 START PRODUCTION 2020-03-15
6 COMPLETE PRODUCTION NaT
7 COMPLETE PRODUCTION 2020-04-28
融化它!!!
import pandas as pd
import numpy as np
df = pd.DataFrame({
'PRODUCT':['1','2'],
'DESIGN_START':[pd.Timestamp('2020-01-05'),pd.Timestamp('2020-01-17')],
'DESIGN_COMPLETE':[pd.Timestamp('2020-01-22'),pd.Timestamp('2020-03-04')],
'PRODUCTION_START':[pd.Timestamp('2020-02-07'),pd.Timestamp('2020-03-15')],
'PRODUCTION_COMPLETE':[np.nan,pd.Timestamp('2020-04-28')]
})
df = df.melt(id_vars=['PRODUCT'])
df_split = df['variable'].str.split('_', n=1, expand=True)
df['STAGE'] = df_split[0]
df['STATUS'] = df_split[1]
df.drop(columns=['variable'], inplace=True)
df = df.rename(columns={'value': 'DATE'})
print(df)
输出:
PRODUCT DATE STAGE STATUS
0 1 2020-01-05 DESIGN START
1 2 2020-01-17 DESIGN START
2 1 2020-01-22 DESIGN COMPLETE
3 2 2020-03-04 DESIGN COMPLETE
4 1 2020-02-07 PRODUCTION START
5 2 2020-03-15 PRODUCTION START
6 1 NaT PRODUCTION COMPLETE
7 2 2020-04-28 PRODUCTION COMPLETE
哇哈哈哈哈哈!!!感受融化的力量!!!
Melt 基本上是逆向的
将“_”上的列转换为 lowercase and split ...设置 expand=True 将其转换为 MultiIndex:
df.columns = df.columns.str.lower().str.split('_',expand=True)
df.columns = df.columns.set_names(['stage','status'])
print(df)
product design production
NaN start complete start complete
0 1 2020-01-05 2020-01-22 2020-02-07 NaT
1 2 2020-01-17 2020-03-04 2020-03-15 2020-04-28
下一阶段是 stack, sort values, droplevel, reset index, and reindex 的组合:
res = (df
.stack([0,1])
.sort_values()
.droplevel(0)
.reset_index(name='Date')
.reindex(['Date','stage','status'],axis=1)
)
res
DATE STAGE STATUS
0 2020-01-05 design start
1 2020-01-17 design start
2 2020-01-22 design complete
3 2020-02-07 production start
4 2020-03-04 design complete
5 2020-03-15 production start
6 2020-04-28 production complete
如果你只对分组和聚合感兴趣,那么你可以跳过长路径,直接在堆栈之后起飞:
df.stack([0,1]).groupby(['stage','status']).count()
stage status
design complete 2
start 2
production complete 1
start 2
Name: Date, dtype: int64
更新 2021/06/01:
您可以使用 pivot_longer function from pyjanitor to abstract the reshaping; at the moment you have to install the latest development version from github:
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.rename(columns=str.lower).pivot_longer(
index="product",
names_sep="_",
names_to=("stage", "status"),
values_to="date",
)
product stage status date
0 1 design start 2020-01-05
1 2 design start 2020-01-17
2 1 design complete 2020-01-22
3 2 design complete 2020-03-04
4 1 production start 2020-02-07
5 2 production start 2020-03-15
6 1 production complete NaT
7 2 production complete 2020-04-28