Pandas 大面板数据的时间序列操作

Question

这是我的大面板数据集：

Date	x1	x2	x3
2017-07-20	50	60	Kevin
2017-07-21	51	80	Kevin
2016-05-23	100	200	Cathy
2016-04-20	20	20	Cathy
2019-01-02	50	60	Leo

此数据集包含 十亿行 。我想做的是，我想计算 x1 和 x2 的 1 天差异百分比：将 t 和 t+1 表示为代表今天和明天的时间。我想计算 (x1_{t+1} - x2_t) / x2_t

首先我用的写法是最快的：

我创建了一个嵌套列表，其中包含每组 x3:

的所有目标值

nested_list = []
flatten_list = []

for group in df.x3.unique():
    df_ = df[df.x3 == group]
    nested_list.append((df_.x1.shift(-1) / df_.x2) / df_.x2))
for lst in nested_list:
    for i in lst:
        flatten_list.append(i)

df["target"] = flatten_list

但是，这个方法将文字化需要一年到运行，这是不可实现的。

我还尝试了本机 pandas groupby 方法以获得可能运行可能的结果，但它没有似乎有效：

def target_calculation(x):
    target = (x.x1.shift(-1) - x.x2) / x.x2
    return target

df["target"] = df.groupby("x3")[["x1", "x2"]].apply(target_calculation)

如何在不使用 for 循环或可能向量化整个过程的情况下进行计算？

Answer 1

您可以 groupby + shift“x1”并从中减去“x2”：

df['target'] = (df.groupby('x3')['x1'].shift(-1) - df['x2']) / df['x2']

输出：

         Date   x1   x2     x3  target
0  2017-07-20   50   60  Kevin   -0.15
1  2017-07-21   51   80  Kevin     NaN
2  2016-05-23  100  200  Cathy   -0.90
3  2016-04-20   20   20  Cathy     NaN
4  2019-01-02   50   60    Leo     NaN

注意

(df.groupby('x3')['x1'].shift(-1) / df['x2']) / df['x2']

产生等同于 flatten_list 的输出，但我认为这不是您真正想要的输出，而是一个错字。

Pandas 大面板数据的时间序列操作

Pandas Time series manipulation with large panel data

python

dataframe

pandas

pandas-groupby