如何检查一列的 str 值,确定另一列是否为 less/greater 而不是新创建的列中的 [x] return 布尔值

How to check a column for str value, determine if another column is less/greater than [x] return boolean in newly made column

我有一个看起来像这样的数据框

product duration
tire change 01:16:51
oil change 05:06:00
tire change 02:03:04
oil change 06:23:14
oil change 03:40:27

我想创建一个新列,returns 一个基于 2 列的布尔值

product duration duration_bool
tire change 01:16:51 True
oil change 01:06:00 True
tire change 04:03:04 False
oil change 02:23:14 False
oil change 03:40:27 False

这是在数据帧上实际使用函数的正确方法吗?我无法理解这是否真的实现了我的目标。

def sla_bool_checker(my_var):

    #check if product is a tire change, if it is, check if duration is under 4 hours and return the Boolean in the new column

    if df['product'] == 'tire change' :
        df['duration_bool'] = df['duration'] < pd.Timedelta(4, unit='h')

    #check if product is a oil change, if it is, check if duration is under 2 hours and return the Boolean

    elif df['product'] == 'oil change' :
        df['duration_bool'] < pd.Timedelta(2, unit='h')

我不知道我遗漏了什么,但这是代码错误。

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

根据您的条件创建一个布尔数组并将其分配给新列。

df['duration'] = df['duration'].apply(pd.Timedelta) # make sure duration has a dtype of Timedelta

df['duration_bool'] = ((df['product'] == 'tire change') & (df['duration'] < pd.Timedelta(4, unit='h'))) | \
((df['product'] == 'oil change') & (df['duration'] < pd.Timedelta(2, unit='h')))

       product        duration  duration_bool
0  tire change 0 days 01:16:51           True
1   oil change 0 days 05:06:00          False
2  tire change 0 days 02:03:04           True
3   oil change 0 days 06:23:14          False
4   oil change 0 days 03:40:27          False

这是什么意思

((df['product'] == 'tire change') & (df['duration'] < pd.Timedelta(4, unit='h'))) 其中产品等于轮胎更换且持续时间少于 4 小时。

|

((df['product'] == 'oil change') & (df['duration'] < pd.Timedelta(2, unit='h'))) 其中产品等于换油且持续时间少于 2 小时

首先,您的两个示例中的 durations 不匹配,这使得比较输入与输出结果变得困难。请下次检查。然后你可以使用:

df.loc[df["product"] == "tire change", "duration_bool"] = pd.to_timedelta(df["duration"]) < pd.Timedelta(4, unit="h")
df.loc[df["product"] == "oil change", "duration_bool"] = pd.to_timedelta(df["duration"]) < pd.Timedelta(2, unit="h")

这直接将行 duration_bool 的值设置为 pd.Timedelta(...) 函数的结果,但 pd.to_timedelta(...) 确保它是要与之比较的时间增量。 这让你:

|    | product     | duration   | duration_bool   |
|---:|:------------|:-----------|:----------------|
|  0 | tire change | 01:16:51   | True            |
|  1 | oil change  | 01:06:00   | True            |
|  2 | tire change | 04:03:04   | False           |
|  3 | oil change  | 02:23:14   | False           |
|  4 | oil change  | 03:40:27   | False           |

我发现我需要在 def sla_bool_checker 中添加一个 return 子句。然后需要使用 apply 将 return 值应用于我的数据框。我仍然无法确切地 了解 apply 是如何工作的,但它确实有效,我希望我能为需要的人提供更深入的解释。

我可能应该使用 np.where() (仍然不清楚如何使它起作用)但@it_is_chris 的回答实际上对我也很有效! (感谢克里斯)

从那以后,我一直在研究,因为我真的很想找出一种使用函数的方法。可能不理想,但我学到了很多东西。

这是我使用的代码。

def sla_bool_checker(my_var):
    #check if product is a tire change, if it is, check if duration is under 4 hours and return the Boolean in new column
    if my_var['product'] == 'tire change' :
        return my_var['duration'] < pd.Timedelta(4, unit='h')
    #check if product is an oil change, if it is, check if duration is under 24 hours and return the Boolean
    elif my_var['product'] == 'oil change' :
        return my_var['duration'] < pd.Timedelta(2, unit='h')

然后我用了

df['duration_bool'] = df.apply(sla_bool_checker, axis=1)     
df

导致

product duration duration_bool
0 tire change 01:16:51 True
1 oil change 01:06:00 True
2 tire change 04:03:04 False
3 oil change 02:23:14 False
4 oil change 03:40:27 False