如何在同一列中的 Pandas 中减去 hierarchical/multi-level 索引中的值

How to subtract values within a hierarchical/multi-level index in Pandas within the same column

我正在尝试查找特定会话(我的索引)内的时间变化——我的数据框如下所示:

                        time
sess_id     vis_id      

id1         vis_id1      t_0
            vis_id1      t_1
            vis_id1      t_2

id2         vis_id2      t_0
            vis_id2      t_1
            vis_id2      t_2

我想创建一个名为 delta_t 的列(时间变化),它递归地减去时间戳——其中每个会话的最后时间包含填充字符,如破折号或其他东西

                        time      delta_t
sess_id     vis_id      

id1         vis_id1      t_0     (t_1 - t_0) 
            vis_id1      t_1     (t_2 - t_1)
            vis_id1      t_2         - 

id2         vis_id2      t_3     (t_4 - t_3)
            vis_id2      t_4     (t_5 - t_4)
            vis_id2      t_5         -

    
    

我们可以groupby shift相对于level=0level="sess_id"得到下一行的值,然后从time中减去:

df['delta_t'] = df.groupby(level='sess_id')['time'].shift(-1) - df['time']

示例数据帧和输出:

                               time         delta_t
sess_id vis_id                                     
id1     vis_id1 2021-01-11 00:00:00 6 days 04:27:31
        vis_id1 2021-01-17 04:27:31 4 days 03:45:26
        vis_id1 2021-01-21 08:12:57             NaT
id2     vis_id2 2021-01-28 15:18:32 7 days 17:57:56
        vis_id2 2021-02-05 09:16:28 4 days 01:41:58
        vis_id2 2021-02-09 10:58:26             NaT

我们可以groupby diff然后groupby shift但这涉及2个groupbys:

df['delta_t'] = (
    df.groupby(level='sess_id')['time'].diff()
        .groupby(level='sess_id').shift(-1)
)

如果在 NaT 上需要 '-' np.where 可用于将 Timedelta 转换为字符串并使用 '-' 填充:

# Calculate Delta
df['delta_t'] = df.groupby(level='sess_id')['time'].shift(-1) - df['time']
# Change dtype and add in '-'
df['delta_t'] = np.where(df['delta_t'].notna(), df['time'].astype(str), '-')

或者可以转换为 strreplace "NaT" with "-":

# Calculate Delta, convert to String, replace "NaT" with "-"
df['delta_t'] = (
        df.groupby(level='sess_id')['time'].shift(-1) - df['time']
).astype(str).replace('NaT', '-')

df:

                               time          delta_t
sess_id vis_id                                      
id1     vis_id1 2021-01-11 00:00:00  6 days 04:27:31
        vis_id1 2021-01-17 04:27:31  4 days 03:45:26
        vis_id1 2021-01-21 08:12:57                -
id2     vis_id2 2021-01-28 15:18:32  7 days 17:57:56
        vis_id2 2021-02-05 09:16:28  4 days 01:41:58
        vis_id2 2021-02-09 10:58:26                -

DataFrame 构造函数和导入:

import pandas as pd

df = pd.DataFrame(
    {'time': pd.to_datetime(['2021-01-11 00:00:00', '2021-01-17 04:27:31',
                             '2021-01-21 08:12:57', '2021-01-28 15:18:32',
                             '2021-02-05 09:16:28', '2021-02-09 10:58:26'])},
    index=pd.MultiIndex.from_arrays((['id1'] * 3 + ['id2'] * 3,
                                     ['vis_id1'] * 3 + ['vis_id2'] * 3),
                                    names=['sess_id', 'vis_id'])
)

@Henry Ecker 先于我,但这是我的回答:

t = [
    {'idx1': 'id1', 'idx2':         'vis_id1',    'val':  1},
    {'idx1': 'id1', 'idx2':         'vis_id2',    'val':  2},
    {'idx1': 'id1', 'idx2':         'vis_id3',    'val':  3},
    {'idx1': 'id2', 'idx2':         'vis_id4',    'val':  4},
    {'idx1': 'id2', 'idx2':         'vis_id5',    'val':  5},
    {'idx1': 'id2', 'idx2':         'vis_id6',    'val':  6},
]
df = pd.DataFrame(t).set_index(['idx1', 'idx2'])

(df
 .reset_index()
 .groupby('idx1')
 .apply(lambda df: df
        .set_index('idx2')
        .sort_index()
        .pipe(lambda df: 
              pd.Series(
                  data=df['val'].pipe(lambda s: s.iloc[1:].values - s.iloc[:-1].values).tolist() + ['-'], 
                  index=df.index
              )
       )
))

这是输出:

idx1  idx2   
id1   vis_id1    1
      vis_id2    1
      vis_id3     -
id2   vis_id4    1
      vis_id5    1
      vis_id6     -

不过,我认为 Henry 的更完整,因为如果第二个中的 ID 相同,我的就不起作用,就像您的示例中那样。

但是,我要指出的是,从数据索引的角度来看,用相同(多个)id 索引多行是一种不好的做法。