Python：根据 Pandas 中的特定字段创建新的 "interpolated" 行

Question

我正在处理篮球数据，我想做一些类似于重采样的事情，但没有日期时间索引，也没有唯一的索引值。

这是我拥有的数据示例：

您可能知道，在篮球比赛中，您一次最多可以得分 3 分。但是，我的分析要求每个点都单独记录，所以我希望在 0 处有一行，1 处有一行，2 处有另一行，而不是“score”或“opp_score”其他字段的数据作为“2”行。

这是我希望的样子 - 我在新行上将索引留空，以便更容易理解正在发生的事情：

我发现的所有用于插值的技术都假设行已经作为 NaN 值存在，或者有一个日期时间字段可以重新采样，但这里不是这种情况。我可以想出如何通过一个笨拙的循环来解决这个问题，使用 .shift() 来检查分数何时出现跳跃，但我认为可能有更好的方法我没有看到。

Answer 1

这是一种没有笨拙循环的方法。如果您熟悉 R，尤其是 dplyr/tidyr，您可以在 python 中使用 datar 进行类似的操作，它由 [=18] 支持=]:

>>> from datar.all import (
...     f, lag, tribble, mutate, uncount, replace_na, pmax, 
...     group_by, rev, if_else, select, ends_with
... )
>>> 
>>> df = tribble(
...     f.period, f.elapsed, f.score, f.opp_score,
...     1,        0,         0,       0,
...     1,        32,        2,       0,
...     1,        72,        5,       0,
...     1,        127,       5,       3,
...     1,        148,       7,       3
... )
>>> (
...     df 
...     >> mutate(   #1
...         score_diff=f.score-lag(f.score), 
...         opp_score_diff=f.opp_score-lag(f.opp_score),
...         count_=pmax(f.score_diff, f.opp_score_diff),
...         count=replace_na(f.count, 1)
...     ) 
...     >> replace_na(0) #2
...     >> group_by(f.elapsed) #2
...     >> uncount(f.count, _id="id") #3
...     >> mutate( #4
...         id=rev(f.id)-1, 
...         score=if_else(f.score_diff>0, f.score-f.id, f.score),
...         opp_score=if_else(f.opp_score_diff>0, f.opp_score-f.id, f.opp_score)
...     )
...     >> select(~f.id, ~ends_with("_diff")) #5
... 
... )
    elapsed  opp_score  period   score
    <int64>    <int64> <int64> <int64>
0         0          0       1       0
1        32          0       1       1
2        32          0       1       2
3        72          0       1       3
4        72          0       1       4
5        72          0       1       5
6       127          1       1       5
7       127          2       1       5
8       127          3       1       5
9       148          3       1       6
10      148          3       1       7

[Groups: elapsed (n=5)]

这里有一些解释：

第一个 mutate 正在创建这样的 df:

   period  elapsed   score  opp_score  score_diff  opp_score_diff     count
  <int64>  <int64> <int64>    <int64>   <float64>       <float64> <float64>
0       1        0       0          0         NaN             NaN       1.0
1       1       32       2          0         2.0             0.0       2.0
2       1       72       5          0         3.0             0.0       3.0
3       1      127       5          3         0.0             3.0       3.0
4       1      148       7          3         2.0             0.0       2.0

我们创建了一些辅助列来帮助我们扩展行，并创建了一些指标供我们稍后填写分数。

replace_na(0) 和 group_by(f.elapsed) 将 NaN 替换为 0 并将 df 与列 elapsed 分组（以便我们可以操作每个 elapsed 之后）
uncount() 只是根据 count 列“取消计算”df（我们应该为当前行复制多少次）。 id 标记重复的行数。我们得到了这样的 df:

    elapsed  opp_score  opp_score_diff  period   score  score_diff      id
    <int64>    <int64>       <float64> <int64> <int64>   <float64> <int64>
0         0          0             0.0       1       0         0.0       1
1        32          0             0.0       1       2         2.0       1
2        32          0             0.0       1       2         2.0       2
3        72          0             0.0       1       5         3.0       1
4        72          0             0.0       1       5         3.0       2
5        72          0             0.0       1       5         3.0       3
6       127          3             3.0       1       5         0.0       1
7       127          3             3.0       1       5         0.0       2
8       127          3             3.0       1       5         0.0       3
9       148          3             0.0       1       7         2.0       1
10      148          3             0.0       1       7         2.0       2

[Groups: elapsed (n=5)]

现在的任务是把右边的score和opp_score填满。这个想法是“un-cumsum”跳跃。例如，对于opp_score，有一个从0到3的跳转，我们应该跳转到0、1、2，和 3。借助 id 列和 *_diff 列的指示，我们可以通过第二个 mutate():

    elapsed  opp_score  opp_score_diff  period   score  score_diff      id
    <int64>    <int64>       <float64> <int64> <int64>   <float64> <int64>
0         0          0             0.0       1       0         0.0       0
1        32          0             0.0       1       1         2.0       1
2        32          0             0.0       1       2         2.0       0
3        72          0             0.0       1       3         3.0       2
4        72          0             0.0       1       4         3.0       1
5        72          0             0.0       1       5         3.0       0
6       127          1             3.0       1       5         0.0       2
7       127          2             3.0       1       5         0.0       1
8       127          3             3.0       1       5         0.0       0
9       148          3             0.0       1       6         2.0       1
10      148          3             0.0       1       7         2.0       0

[Groups: elapsed (n=5)]

最后使用 select()

Answer 2

您可以使用 reindex with a suitable range and interpolate.

根据总分创建一个唯一索引。在使用范围回填列重新编制索引后 elapsed 因为不应对值进行插值。

df = df.set_index(df.score + df.opp_score)
df = df.reindex(np.arange(df.index.min(), df.index.max()+1))
df['elapsed'] = df.elapsed.bfill()
df.interpolate().astype(int)

出来

    period  elapsed  score  opp_score
0        1        0      0          0
1        1       32      1          0
2        1       32      2          0
3        1       72      3          0
4        1       72      4          0
5        1       72      5          0
6        1      127      5          1
7        1      127      5          2
8        1      127      5          3
9        1      148      6          3
10       1      148      7          3

生成使用的数据帧。

import pandas as pd

df = pd.DataFrame({
    'period': [1,1,1,1,1],
    'elapsed': [0,32,72,127,148],
    'score': [0,2,5,5,7],
    'opp_score': [0,0,0,3,3]
})
df

出来

   period  elapsed  score  opp_score
0       1        0      0          0
1       1       32      2          0
2       1       72      5          0
3       1      127      5          3
4       1      148      7          3

Python：根据 Pandas 中的特定字段创建新的 "interpolated" 行

Python: creating new "interpolated" rows based on a specific field in Pandas

python

interpolation

pandas