Python

Question

我有一个 excel 模板，可以根据一列剩余或不足的能量来跟踪电池的充电状态（每小时）。它还跟踪电池的充电和放电。

我正在尝试以更大的规模（约 4000 万行）将其转换为 python（使用 pandas）

逻辑如下：

如果有剩余，则收费
如果有不足，则卸货
最大充电状态 (soc) 只能为 80，最小充电状态必须为 0。
一小时最多可以充放电20次
我们从 80 的 soc 开始

数据看起来像这样：

hour | surplus_shortfall
------------------------
1      15
2     -84
3     -70
4     -60
5     -50
6     -30
7      10
8      36
9      45
10     60
11     22
12    -10
13    -23
14    -8

我们可以使用np.where分别为max_charge和max_discharge创建列；例如，data['max_charge] = np.where(data['surplus_shortfall'] > 0, np.min(data['surplus_shortfall'], 20), 0)

我还需要跟踪实际收费金额的列，actual_charge（b/c 召回 soc 不能超过 80）和实际出院金额 actual_discharge（b/c soc不能低于 0）。最后，我需要 initial_soc 和 end_soc

的列

我将为下面的第一行定义数据点。

对于第一行，我们可以定义如下：

actual_charge 将始终为 0，data.loc[0, 'actual_charge'] = 0 因为我们开始充满电
actual_discharge 将是 data.loc[0, 'actual_discharge'] = np.where(data.loc[0, 'max_discharge'] == 0, 0, data.loc[0, 'max_discharge'].
initial_soc定义为data.loc[0, initial_soc] = 80
end_soc 作为 data.loc[0, 'end_soc'] = data.loc[0, 'initial_soc'] + data.loc[0, 'actual_charge'] - data.loc[0, 'actual_discharge']

现在，结果 table 如下所示：

hour | surplus_shortfall | initial_soc | max_charge | max_discharge | actual_charge | actual_discharge | end_soc
-----------------------------------------------------------------------------------------------------------------
1       15                 80            15            0               0              0                 80            
2      -84                               0             20
3      -70                               0             20
4      -60                               0             20
5      -50                               0             20
6      -30                               0             20
7       10                               10            0
8       36                               20            0
9       45                               20            0  
10      60                               20            0
11      22                               20            0
12     -10                               0             10
13     -23                               0             10
14     -8                                0             8

我想做的是以同样的方式填写其余的行。但问题是 initial_soc 取决于上一行的 end_soc。

如果我有一个伪算法来做到这一点，它会像这样：

for row in dataframe:
    if row == 0:
        continue
    # define initial_soc as the end_soc of the previous row
    row['initial_soc'] = row['end_soc'}.shift()  # syntax to access previous item in loop escaping me atm
    # define actual_discharge
    if row['initial_soc'] != 0:
        row['actual_discharge'] = np.min(row['max_discharge'], row['initial_soc'])
    else
        row['actual_discharge'] = 0
    # define actual_charge
    if row['initial_soc'] < 80:
        row['actual_charge'] = np.min(row['max_discharge'], 80 - row['initial_soc'])
    elif row['initial_soc'] == 80:
        row['actual_charge'] = 0
    # calculate end_soc
    row['end_soc'] = row['initial_soc'] + row['actual_charge'] - row['actual_discharge']

结果 table 看起来像这样：

hour | surplus_shortfall | initial_soc | max_charge | max_discharge | actual_charge | actual_discharge | end_soc
-----------------------------------------------------------------------------------------------------------------
1      15                  80            15           0               0               0                  80            
2      -84                 80            0            20              0               20                 60     
3      -70                 60            0            20              0               20                 40 
4      -60                 40            0            20              0               20                 20              
5      -50                 20            0            20              0               20                 0
6      -30                 0             0            20              0               0                  0 
7       10                 0             10           0               10              0                  10            
8       36                 10            20           0               20              0                  30            
9       45                 30            20           0               20              0                  50
10      60                 50            20           0               20              0                  70
11      22                 70            20           0               10              0                  80 
12     -10                 80            0            10              0               10                 70
13     -23                 70            0            20              0               20                 50
14     -8                  50            0            8               0               8                  42

我不喜欢拥有这些确切的专栏。真正重要的是以某种方式跟踪 SOC，然后知道每小时实际充电或放电了多少。

我尝试使用 .cumsum() 和 .clip() 的一些组合对其进行矢量化，但没有成功。

关于如何在不使用笨重循环的情况下解决这个问题的任何想法（同样，我拥有的 4000 万行会使这变得非常乏味）？

Answer 1

好吧，我无法摆脱 for 循环，因为您的代码是顺序的，而矢量化是针对可以并行化的代码。我认为我能够使用本机 Pandas 和 NumPy 方法简化（并且可能加快速度？还没有计时）您的一些代码，这意味着我可以摆脱所有 if 语句你的伪代码。我还删除了一些看起来完全没有必要的专栏（底部有解释）。

这是我的代码：

max_soc = 80
max_charge_rate = 20

# I combined max_charge and max_discharge into 1 column.
# Positive indicates a charge value; negative indicates a discharge value.
df["max_charge_or_discharge"] = df["surplus_shortfall"].clip(lower = -max_charge_rate, upper = max_charge_rate)

# soc_values will eventually contain all the end_soc values.
# I am foregoing initial_soc.
soc_values = [max_soc]

# soc_diffs is a combination of actual_charge and actual_discharge, again with charge values being positive
# and discharge values being negative.
soc_diffs = []
max_charge_or_discharge_np = df["max_charge_or_discharge"].to_numpy()

for i in range(len(df)):
  last_soc_val = soc_values[-1]
  soc_val = np.clip(last_soc_val + max_charge_or_discharge_np[i], a_min = 0, a_max = max_soc)

  soc_diffs.append(soc_val - last_soc_val)
  soc_values.append(soc_val)

# Add columns to df.
df["end_soc_values"] = soc_values[1:]
df["soc_diffs"] = soc_diffs

这里有一个关于如何获得 max_charge、max_discharge、actual_charge 和 actual_discharge 列的小红利部分，因为它们出现在您的 df 版本，来自我最终版本的值 df:

max_charge = df["max_charge_or_discharge"].mask(df["max_charge_or_discharge"] < 0, 0)
max_discharge = df["max_charge_or_discharge"].mask(df["max_charge_or_discharge"] >= 0, 0) * -1

actual_charge = df["soc_diffs"].mask(df["soc_diffs"] < 0, 0)
actual_discharge =  df["soc_diffs"].mask(df["soc_diffs"] >= 0, 0) * -1

combining/getting 删除列的原因：

在您的版本中，max_charge = 0 当 max_discharge != 0 时，反之亦然。这就浪费了space。由于正值表示一件事而负值表示另一件事对于 surplus_shortfall 列来说工作得很好，所以 max_charge_or_discharge 列没有理由不能遵循相同的逻辑。
我决定不包括 initial_soc，因为该列 与 end_soc 完全相同，只是移动了 1 行。那是 4000 万个额外的值，只是因为 initial_soc 在顶部有一个额外的 80。鉴于每行的 initial_soc 只是该行的 end_soc - soc_diffs，initial_soc 列甚至没有告诉您任何新内容——它是不必要的并且浪费了 space。
老实说，如果你能在某处记下第一行的 initial_soc 是 80，你甚至不需要 soc_diffs 列。

如果 df 中没有 soc_diffs 列，您可以通过以下方式找到 soc_diffs：

end_soc_values_np = df["end_soc_values"].to_numpy()
end_soc_values_np = np.concatenate(([max_soc], end_soc_values_np))
soc_diffs = end_soc_values_np[1:] - end_soc_values_np[:-1]

如果您有任何问题，请告诉我。

Python - 向量化电池充电状态跟踪器的迭代算法

Python - Vectorize an iterative algorithm for battery state of charge tracker

numpy

pandas