Pandas中是否有连续的组标签功能

Is there a contiguous group labelling capability in Pandas

我一直在为这个问题困惑一段时间。如果有一些“Pandas”之类的方法可以到达那里,我正在徘徊。我有一个简单的 DataFrame,其中包含两列 PivotHigh 和 PivotLow,分别代表高值和低值。我需要从最低点到最高点成对地“连接点”。听起来很简单,但有一些复杂的地方:

我在图中用黄色标注了“运行s”,绿色标注了应该选择的值(max/min)

我一直在尝试以下方法:

def find_swings(df_pivots, 
column_names={"HighPivots":"HighPivots","LowPivots":"LowPivots"}):
"""
    Given a Dataframe of PivotHigh and PivotLow will return a Dataframe describing the pivots joined together.

    Parameters:
        df_pivots (DataFrame): Contains pivot information
            PivotHigh   - A sereis of pivot highs
            PivotLow    - A series of pivot lows

        column_names (dictionary): Specify the column names for the relevant columns

    Returns:
        DataFrame:
            StartBar        - Start Date of the Bar
            EndBar          - End Date of the Bar
            StartValue      - Starting value of the swing
            EndValue        - Ending value of the swing
            Diff            - EndValue - StartValue
            SwingType       - "Up" | "Down"
            N_Bars          - Number of Bars
            Gradient        - (EndValue-StartValue)/N_Bars
            Angle           - Angle of the Swing in Radians
            Magnitude       - Magnitude of the Swing: Sqrt(Diff^2 + N_Bars^2)
            
            (Next Phase)
            ClosePriceArea  - Close Price Area: The unsigned area enclosed by the swing
                                Lower means tight fit on trend
"""
df_ret = pd.DataFrame()
df = df.reindex()
start_index = 0
current_pivot = "PivotLow" if df["PivotLow"].index.min() <= df["PivotHigh"].index.min() else "PivotHigh"
alternate_pivot = "PivotHigh" if current_pivot == "PivotLow" else "PivotLow"

while(True):
    swing_end = df_t.loc[df_t[alternate_pivot].notnull()].index[0]
    swing_start = df.iloc[df.iloc[:swing_end]]
    df_trimmed = df_ret[swing_start:swing_end]
    if (current_pivot == "PivotLow"):
        df_ret["StartDate"] = df_trimmed.loc[df_trimmed[current_pivot].idxmin(axis=1)]["Date"]
        #....
    else: 
        df_ret["StartDate"] = df_trimmed.loc[df_trimmed[current_pivot].idxmax(axis=1)]["Date"]
        #....

但从那里开始,事情似乎像滚雪球一样失控了。

有没有办法在“组”上引入一个新的顺序标签列,然后我就可以对它们执行一些 max/min 聚合?

或任何其他将军advice/approaches?

谢谢

数据:

Date,PivotHigh,PivotLow
2020-01-01 02:45:00,,7210.59
2020-01-01 04:30:00,,7195.01
2020-01-01 06:30:00,,7204.53
2020-01-01 07:15:00,7209.15,
2020-01-01 11:15:00,7215.0,
2020-01-01 11:30:00,,7198.99
2020-01-01 14:30:00,,7200.19
2020-01-01 15:30:00,7238.14,
2020-01-01 17:15:00,,7221.18
2020-01-01 18:15:00,,7219.01
2020-01-01 20:30:00,,7209.6
2020-01-01 21:00:00,7222.6,
2020-01-01 21:45:00,,7210.73
2020-01-02 00:45:00,7186.18,7179.77
2020-01-02 06:30:00,,7112.27
2020-01-02 10:30:00,7156.65,7139.93
2020-01-02 14:45:00,7137.23,
2020-01-02 19:45:00,6976.17,6965.0
2020-01-02 22:45:00,,6954.34
2020-01-03 04:30:00,7281.24,
2020-01-03 07:15:00,7220.0,
2020-01-03 10:15:00,7361.7,7316.31
2020-01-03 12:15:00,7339.31,7329.66
2020-01-03 13:45:00,7369.61,7319.0
2020-01-03 16:15:00,7401.2,
2020-01-03 16:45:00,,7358.18
2020-01-03 20:00:00,7349.44,
2020-01-03 21:15:00,,7319.45
2020-01-03 23:30:00,,7315.0
2020-01-04 00:00:00,7339.99,
2020-01-04 03:00:00,7332.89,
2020-01-04 05:30:00,7359.99,
2020-01-04 07:00:00,,7341.89
2020-01-04 07:15:00,7354.98,
2020-01-04 10:15:00,7357.39,7343.95
2020-01-04 11:45:00,,7319.19
2020-01-04 15:15:00,7317.58,
2020-01-04 15:30:00,,7304.6
2020-01-04 18:00:00,,7328.74
2020-01-04 18:30:00,7398.0,
2020-01-04 21:30:00,7358.51,7341.03
2020-01-05 02:00:00,7483.28,7441.4
2020-01-05 03:45:00,,7460.0
2020-01-05 04:00:00,7488.87,
2020-01-05 06:00:00,7478.24,
2020-01-05 07:45:00,,7462.22
2020-01-05 08:00:00,7475.0,
2020-01-05 13:00:00,7460.03,7446.36
2020-01-05 16:30:00,7479.52,
2020-01-05 17:45:00,,7460.0

让我们在isna上尝试cumsum来识别连续的非nan块,然后使用groupby().ngroup()来识别每个块的普通数:

new_df = pd.DataFrame()
for col, op in zip(df.columns, [np.max, np.min]):
    s = df[col]

    blocks = s.isna().cumsum()
    non_na_blocks = s.dropna().groupby(blocks).ngroup()
    new_df[col] = s.dropna().groupby(non_na_blocks).apply(op)

输出(前两个块):

   PivotHigh  PivotLow
0    7215.00   7195.01
1    7238.14   7198.99