Pandas:将列切片聚合为数组

Pandas: Aggregating column slices as arrays

我有一个 Pandas 数据框,看起来是这样的:

                      Scaled
               Date 
2020-07-01 02:40:00 0.604511
2020-07-01 02:45:00 0.640577
2020-07-01 02:50:00 0.587683
2020-07-01 02:55:00 0.491515
....

我正在尝试添加一个名为 X 的新列,它看起来应该是这样的,其中每两个先前的值变成一个数组:

                      Scaled   X
               Date 
2020-07-01 02:40:00 0.604511 nan
2020-07-01 02:45:00 0.640577 nan
2020-07-01 02:50:00 0.587683 [0.604511 0.640577]
2020-07-01 02:55:00 0.491515 [0.640577 0.587683]
...

我正在尝试使用 for 循环来执行此操作,但我认为这不是最优雅和最有效的方法,所以在 pandas 中有关于如何执行此操作的任何建议吗? (但没有按预期进行)

window_size = 2
for i in range(window_size, df.shape[0]):
    df['X'][i] = df['Scaled'][i - window_size:window_size] 

您使用 for 循环的想法是正确的。

首先,您必须初始化新列,您可以在数据框上使用 .apply() 来执行此操作。

然后您可以使用 .iterrows() 遍历数据帧的索引,在遍历行时创建所需的数组。

import pandas as pd

df = pd.DataFrame(data={'Date': ['2020-07-01 02:40:00', '2020-07-01 02:45:00', '2020-07-01 02:50:00', '2020-07-01 02:55:00'], 'Scaled': [0.604511, 0.640577, 0.587683, 0.491515]})

df['New_col'] = df['Scaled'].apply(lambda x : float("NAN"))

for i, val in df.iterrows():
  if i == 0 or i == 1:
    scaled_a = None
    scaled_b = None
  else:
    scaled_a = df['Scaled'][i-2]
    scaled_b = df['Scaled'][i-1]
  df['New_col'][i] = [scaled_a, scaled_b] 

只需将新列的值分配给前两个索引处的数据框缩放列的值,然后将其保存在数组中。希望这对您有所帮助!!

    Date                Scaled      New_col
0   2020-07-01 02:40:00 0.604511    [None, None]
1   2020-07-01 02:45:00 0.640577    [None, None]
2   2020-07-01 02:50:00 0.587683    [0.604511, 0.640577]
3   2020-07-01 02:55:00 0.491515    [0.640577, 0.587683]

结果应该是这样的。 ^^

已更新 相同的输出。这是一个 pandas 实现。使用 numpy 生成列表,它是超级高效的 df 的 pandas 列。

d = list(pd.date_range(dt.datetime(2020,7,1), dt.datetime(2020,7,2), freq="15min"))
df = pd.DataFrame({"Date":d, 
      "Scaled":[round(Decimal(random.uniform(0, 1)),6) for x in d]})


# generate two new arrays that are shifted version of *scaled*
a1 = np.roll(df["Scaled"],1)
a1[0:2] = None
a2 = np.roll(df["Scaled"],2)
a2[0:2] = None
# combine them into a list and put back into df
df['X'] = np.vstack((a2, a1)).T.tolist()

print(df[:10].to_string(index=False))

输出

               Date    Scaled                     X
2020-07-01 00:00:00  0.396534          [None, None]
2020-07-01 00:15:00  0.890777          [None, None]
2020-07-01 00:30:00  0.241534  [0.396534, 0.890777]
2020-07-01 00:45:00  0.800615  [0.890777, 0.241534]
2020-07-01 01:00:00  0.161382  [0.241534, 0.800615]
2020-07-01 01:15:00  0.727410  [0.800615, 0.161382]
2020-07-01 01:30:00  0.146833  [0.161382, 0.727410]
2020-07-01 01:45:00  0.925441  [0.727410, 0.146833]
2020-07-01 02:00:00  0.770211  [0.146833, 0.925441]
2020-07-01 02:15:00  0.310082  [0.925441, 0.770211]

这是一个没有 for 循环的版本。首先,创建数据框:

from io import StringIO

data = '''Date  Scaled 
2020-07-01 02:40:00  0.604511
2020-07-01 02:45:00  0.640577
2020-07-01 02:50:00  0.587683
2020-07-01 02:55:00  0.491515
'''
df = pd.read_csv(StringIO(data), sep='\s\s', engine='python')

接下来,使用 shift() 获取先前的值,lambda 函数创建 2 元素列表或产生单个 NaN:

f = lambda a, b: np.nan if np.isnan(a) or np.isnan(b) else [a, b]

window_size = 2

t = (pd.concat([df['Scaled'].shift(window_size).rename('a'), 
                df['Scaled'].shift(window_size - 1).rename('b')], axis=1
          )
       .apply(lambda x: f(x['a'].round(6), x['b'].round(6)), axis=1))

print(t)

0                     NaN
1                     NaN
2    [0.604511, 0.640577]
3    [0.640577, 0.587683]
dtype: object

要使用 pandas,您可以使用列表理解以及 concatshift

window_size = 2
df['X'] = (pd.concat([df.Scaled.shift(-i) for i in range(window_size)], axis=1)
             .shift(window_size).values.tolist())

Out[213]:
     Scaled                               X
0  0.604511                      [nan, nan]
1  0.640577                      [nan, nan]
2  0.587683  [0.604511, 0.6405770000000001]
3  0.491515  [0.6405770000000001, 0.587683]