如何识别每个人每次变量的变化（在面板数据中）？

Question

我有面板数据（每个 ID 在不同时间点的重复观察）。数据不平衡（存在差距）。我需要检查并可能调整多年来每个人的变量变化。

我试了两个版本。首先，一个 for 循环设置，首先访问每个人及其每一年。第二，单行组合 groupby。 Groupby 在我看来更优雅。这里的主要问题是识别"next element"。我假设在一个循环中我可以用一个计数器来解决这个问题。

这是我的 MWE 面板数据：

import pandas as pd
df = pd.DataFrame({'year': ['2003', '2004', '2005', '2006', '2007', '2008', '2009','2003', '2004', '2005', '2006', '2007', '2008', '2009'],
                   'id': ['1', '1', '1', '1', '1', '1', '1', '2', '2', '2', '2', '2', '2', '2'],
                   'money': ['15', '15', '15', '16', '16', '16', '16', '17', '17', '17', '18', '17', '17', '17']}).astype(int)
df

每个人的时间序列如下所示：

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

fig, ax = plt.subplots()

for i in df.id.unique():
    df[df['id']==i].plot.line(x='year', y='var', ax=ax, label='id = %s'%i)
    df[df['id']==i].plot.scatter(x='year', y='var', ax=ax)
    plt.xticks(np.unique(df.year),rotation=45)

这里是我要实现的：对于每个人，比较时间序列的值，并丢弃每一个与其前导值不同的后继者（识别红色圆圈）。然后我会尝试不同的策略来处理它：

放弃（非常不确定）：如果后继者不同，放弃它
平滑（绝对值）：如果后继者相差（比如说）1 个单位，则为其分配前导值
平滑（相对值）：如果后继者相差（比如说）1%，则为其分配前导值

解掉

df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
df_new = df.drop(df[df['money_difference'].abs()>0].index)

想法顺利

# keep track of change of variable by person and time
df['money_difference'] = df['money']-df.groupby('id')['money'].shift(1)
# first element has no precursor, it will be NaN, replace this by 0
df = df.fillna(0)
# now: whenever change_of_variable exceeds a threshold, replace the value by its precursor - not working so far
df['money'] = np.where(abs(df['money_difference'])>=1, df['money'].shift(1), df['money'])

Answer 1

要获取数据库中的下一个事件，您可以结合使用 groupby 和 shift，然后对上一个事件进行减法运算：

df['money_difference'] =df.groupby(['year', 'id'])['money'].shift(-1)-df['money']

如何识别每个人每次变量的变化（在面板数据中）？

How to identify changes in a variable per person per time (in panel data)?

python

loops

pandas

panel-data

pandas-groupby