使用多个条件和时差创建新列

Question

我有以下数据框有一个棘手的问题：

df = pd.DataFrame({'weight': [[200, 190, 188, 180, 170], 
                              [181, 175, 172, 165, 150]],
           'days_since_gym': [[0, 87, 174, 205, 279], 
                              [43, 171, 241, 273, 300]]})

print(df)
              weight               days_since_gym
0  [200, 190, 188, 180, 170]    [0, 91, 174, 205, 279]
1  [181, 175, 172, 165, 150]  [93, 171, 241, 273, 300]

我必须根据以下条件制作 4 列（0-90 天、91-180 天、181-270 天、271-360 天）：

1) If there are multiple weights in a specific time duration, get the maximum weight in that time duration column.

2) If no weight is present in that time duration, the value for that duration would be 0.

期望的输出：

             weight                 days_since_gym     0-90   91-180   181-270  271-360

0  [200, 190, 188, 180, 170]    [0, 87, 174, 205, 279] 200      188       180       170
1  [181, 175, 172, 165, 150]  [93, 171, 241, 273, 300]  0       181       172       165

最明智的做法是什么？任何建议，将不胜感激。谢谢！

Answer 1

您可以编写一个自定义函数来接收权重列表、开始日期、结束日期 — 然后逐行应用此函数以使用 pandas apply 函数创建每个新列.如果您以前没有使用过 apply，基本结构类似于：df.apply(lambda x: custom_function(...), axis=1)。参数 axis=1 确保您的自定义函数按行应用。

由于新列的名称也是开始日期和结束日期，因此您可以遍历这些开始日期和结束日期范围。

我还注意到，在您的问题中，您创建的 DataFrame 与所需输出之间似乎存在一些不匹配，因此我将所需输出作为 DataFrame。

import numpy as np
import pandas as pd

df = pd.DataFrame({'weight': [[200, 190, 188, 180, 170], 
                              [181, 175, 172, 165, 150]],
           'days_since_gym': [[0, 87, 174, 205, 279], 
                              [93, 171, 241, 273, 300]]})

def return_max_weight(weights, days, start_day, end_day):
    ## get the indices where weights are between start and end days
    days = np.array(days)
    weights_idx = list(np.where((days >= start_day) & (days <= end_day))[0])
    if len(weights_idx) == 0:
        return 0
    else:
        weight_between_start_and_end = [weights[idx] for idx in weights_idx]
        return max(weight_between_start_and_end)

for start_day, end_day in zip([0, 91, 181, 271],[90, 180, 270, 360]):
    col_name = f"{start_day}-{end_day}"
    df[col_name] = df[['weight','days_since_gym']].apply(
        lambda x: return_max_weight(x['weight'], x['days_since_gym'], start_day, end_day),
        axis=1
    )

输出：

>>> df
                      weight            days_since_gym  0-90  91-180  181-270  271-360
0  [200, 190, 188, 180, 170]    [0, 87, 174, 205, 279]   200     188      180      170
1  [181, 175, 172, 165, 150]  [93, 171, 241, 273, 300]     0     181      172      165

使用多个条件和时差创建新列

Create New Columns Using Multiple Conditions And Time Difference

python

timedelta

dataframe

pandas