Python Pandas Dataframe - 我的函数没有返回任何内容

Question

我有两个数据框：

energy_calculated（time_stamp 列仅使用 3 个十进制值进行格式化，以确保没有任何隐藏值破坏简单的数学运算）：

    fl_key min_time_stamp   max_time_stamp      energy
0    10051 1614556800019.000 1614556807979.000   0.352
1    10051 1614556808019.000 1614556815979.000   0.275
2    10051 1614556816019.000 1614556823979.000   0.429
3    10051 1614556824019.000 1614556831979.000   0.406
4    10051 1614556832019.000 1614556839979.000   0.444
5    10051 1614556840019.000 1614556847979.000   0.348
6    10051 1614556848019.000 1614556855979.000   0.381
7    10051 1614556856019.000 1614556863979.000   0.456
8    10051 1614556864019.000 1614556871979.000   0.362
9    10051 1614556872019.000 1614556879979.000   0.465
10   10051 1614556880019.000 1614556887979.000   0.577
11   10051 1614556888019.000 1614556895979.000   0.305
12   10051 1614556896019.000 1614556903979.000   0.347
13   10051 1614556904019.000 1614556911979.000   0.246
14   10051 1614556912019.000 1614556919939.000   0.340

df_test:

      fl_Key  time_stamp        energy       install_prediction
1007   10051  1614556840299      -1                  -1
491    10051  1614556819659      -1                  -1
1944   10051  1614556877779      -1                  -1
2227   10051  1614556889099      -1                  -1
677    10051  1614556827099      -1                  -1
2944   10051  1614556917779      -1                  -1
799    10051  1614556831979      -1                  -1
2378   10051  1614556895139      -1                  -1
1877   10051  1614556875099      -1                  -1
487    10051  1614556819499      -1                  -1

我正在尝试从 df_test 数据帧中查找 fl_Key 和 time_stamp，使用它们从 energy_calculated 数据帧中查找“能量”值. fl_Key 到 fl_key 列应该完全匹配。 time_stamp 列应位于最小和最大 time_stamp 列之间。

fl_Key 和 fl_key 名称不同，因此我可以跟踪哪个列来自哪里。

我有一个简单的方法（我放入引发异常只是为了确保它总能找到匹配项）：

def integrateEnergyCalculationData(row, energy_calculations):
  energy_calculations = energy_calculations[(energy_calculations['fl_key'] == row.fl_Key) & (energy_calculations['min_time_stamp'] <= row.time_stamp) & (energy_calculations['max_time_stamp'] >= row.time_stamp)]

  if (len(energy_calculations) == 0):
    raise Exception("No energy data for: " + str(row.fl_Key) + ", " + str(row.time_stamp))
  elif (len(energy_calculations) >= 2):
    raise Exception("Too much energy data for: " + str(row.fl_Key) + ", " + str(row.time_stamp))

  return energy_calculations['energy']

我使用 apply() 将它们连接在一起：

df_test['energy'] = df_test[['time_stamp','fl_Key']].apply(integrateEnergyCalculationData, 1, args=(energy_calculated, ))

最终发生的是映射是针对某些行而不是所有行进行的：

我生成的 df_test 数据框看起来像（我有一个更大版本的 df_test，但我已将其缩短为 10 行以演示该问题）。我从较大的版本中随机选择了 10 行——这就是索引号不正常的原因：

       fl_Key    time_stamp            energy     install_prediction
1007    10051    1614556840299                          -1
491     10051    1614556819659    0.4291915384067029    -1
1944    10051    1614556877779                          -1
2227    10051    1614556889099                          -1
677     10051    1614556827099                          -1
2944    10051    1614556917779                          -1
799     10051    1614556831979                          -1
2378    10051    1614556895139                          -1
1877    10051    1614556875099                          -1
487     10051    1614556819499    0.4291915384067029    -1

我错过了什么？谢谢。

Answer 1

这很奇怪。我把你的两个数据框变成了我自己的一个，运行你的代码。转载你的差距。然后，我把 pdb 放在你的 return 语句之前，它是 returning 一个 object，而不是一个浮点数！事实上，所有行都是对象。我把这一行：

return float(energy_calculations['energy'])

并获得了完整的数据框。

    index   fl_Key  time_stamp  energy  install_prediction
0   1007    10051   1614556840299   0.348   -1
1   491     10051   1614556819659   0.429   -1
2   1944    10051   1614556877779   0.465   -1
3   2227    10051   1614556889099   0.305   -1
4   677     10051   1614556827099   0.406   -1
5   2944    10051   1614556917779   0.340   -1
6   799     10051   1614556831979   0.406   -1
7   2378    10051   1614556895139   0.305   -1
8   1877    10051   1614556875099   0.465   -1
9   487     10051   1614556819499   0.429   -1

Since I don't have access to your dataframes, maybe some type weirdness is going on that you should fix.

从头开始。您可以使用 energy_calculations['energy'].values[0] 实现相同的效果，而无需转换为浮点数。

Answer 2

尝试使用 merge:

df_new = df_energy.rename(columns={'fl_key': 'fl_Key'})\
                  .merge(df_test[['fl_Key', 'time_stamp']], on='fl_Key', how='left')

print(df_new.loc[df_new['time_stamp']\
      .between(df_new['min_time_stamp'], df_new['max_time_stamp']), 'energy'])

输出：

Python Pandas Dataframe - 我的函数没有返回任何内容

Python Pandas Dataframe - Nothing being returned from my function

python

nan

dataframe

pandas