在数据框中的数据子集上计算 python 中的 diff()

Question

我是 Python 的新手，来自 SAS。我想计算连续行之间的滞后变量（使用 diff() 的时间差），但我想在每次遇到新个体时重新启动该过程。在 SAS 中，这是使用 dif() 或 lag() 使用 by-command 完成的。有没有类似的方法可以使用 Python 来做到这一点？

这是我想要的数据（每次遇到 PIT 的新值时请注意丢失的数据）：

PIT Receiver    tottime     Lag
1   1   2015-01-21 12:00:00 
1   1   2015-01-21 12:00:05 5
1   1   2015-01-21 12:00:20 15
1   1   2015-01-21 12:00:30 10
1   1   2015-01-21 12:00:35 5
1   2   2015-01-22 12:00:35 86400
1   2   2015-01-22 12:00:50 15
1   2   2015-01-22 12:00:55 5
1   2   2015-01-22 12:01:05 10
1   2   2015-01-22 12:01:10 5
2   1   2015-01-12 12:01:10 
2   1   2015-01-12 12:01:15 5
2   2   2015-01-12 12:01:20 5
2   2   2015-01-12 12:01:25 5
2   2   2015-01-12 12:01:30 5

我用这段代码试过了：

Clean['tottime']=pd.to_datetime(Clean.tottime.values)   #Convert tottime to     datetime value
tindex=Clean.tottime.values                             #Create vector of time values that will become part of a multi-index
arrays = [Clean.PIT.values,tIndex]                      # Define arrays object, which contains both levels of the multi-index

index = pd.MultiIndex.from_arrays(arrays, names = ['PIT','tottime'])                # declare multi level index
Clean.index = index

Clean['lag'] = Clean.tottime.diff()                                     #    calculated difference in tottime between rows
Clean['lag'] = Clean['lag']/np.timedelta64(1,'s')                       #This converts 'lag' to a numeric (float64) value

但这会产生类似这样的结果（即在第一行工作，但随后无法识别新的 PIT 值）：

PIT Receiver    tottime    Lag
1   1   2015-01-21 12:00:00 
1   1   2015-01-21 12:00:05 5
1   1   2015-01-21 12:00:20 15
1   1   2015-01-21 12:00:30 10
1   1   2015-01-21 12:00:35 5
1   2   2015-01-22 12:00:35 86400
1   2   2015-01-22 12:00:50 15
1   2   2015-01-22 12:00:55 5
1   2   2015-01-22 12:01:05 10
1   2   2015-01-22 12:01:10 5
2   1   2015-01-12 12:01:10 -864000
2   1   2015-01-12 12:01:15 5
2   2   2015-01-12 12:01:20 5
2   2   2015-01-12 12:01:25 5
2   2   2015-01-12 12:01:30 5

所以它无法在新的 PIT 上重置，我得到一个很大的负数（10 天前）。最终我希望能够在 PIT 和 Receiver 上执行此操作，但目前的挑战是在 tottime 上迭代此过程，按 PIT 分组。有关如何执行此操作的任何建议？

此外，我怀疑这是一个常见问题（副处理）的一个子集，但我不知道如何用 Python-speak 来表达这个问题，所以我在 Whosebug 网站上找不到它们.任何指导将不胜感激。

谢谢！

Answer 1

一种方法是使用 pandas groupby() 功能。

这是一种稍微麻烦的方法，因为我没有您的代码，但您可以尝试以下方法，假设您的 DataFrame 与您显示的格式相同，但没有 lag 列。

首先，创建一个函数，diff_func，它将应用于 groupby 对象。

def diff_func(df):
    return df.diff()

然后使用groupby():

Clean['Lag'] = Clean.groupby('PIT')['tottime'].apply(diff_func)

上面的行基本上按列 PIT 对 Clean 进行分组，告诉 pandas 将函数应用于列 tottime，然后将其转储到新列中Lag。

Answer 2

所以您是说每当您的 PIT 与上一行不同时？这很简单：

df.loc[df.PIT != df.PIT.shift(1), 'Lag'] = 0

在数据框中的数据子集上计算 python 中的 diff()

calculate diff() in python on subsets of data within a dataframe

python

timedelta

multiple-columns