Complex Groupby Pandas 操作替换 For 循环和 If 语句

Complex Groupby Pandas Operation to Replace For Loops and If Statements

我有一组复杂的团体问题需要帮助。

我有 driver 的名字,随着时间的推移,他们每个人都开过好几辆车。每次他们打开汽车并开车时,我都会捕获周期和小时数,这些信息会被远程传输。

我想做的是使用分组来查看 driver 何时获得新车。 我正在使用 Car_Cycles 和 Car_Hours 来监视重置(新车)。每个 driver 的小时数和周期按升序制成表格,直到有新车并重新设置。我想把每辆车做成一个序列,但逻辑上只能通过cycle/hour重置来识别汽车。

我使用带有 if 语句的 for 循环在数据帧上执行此操作,处理时间需要几个小时。我有几十万行,每行包含大约 20 列。

我的数据来自通过中等可靠连接的传感器,因此我想使用以下条件进行过滤:只有当 Car_Hours 和 Car_Cycles 都小于前一个时,新组才有效连续 2 行的组的最后一行。使用两个输出并检查两行更改足以过滤所有错误数据。

如果有人能告诉我如何在不使用繁琐的 for 循环和 if 语句的情况下快速解决 Car_Group,我将不胜感激。

此外,对于那些非常有冒险精神的人,我在下面添加了带有 if 语句的原始 for 循环。请注意,我在每个组中做了一些其他数据 analysis/tracking 以查看汽车的其他行为。如果您敢于查看该代码并向我展示一个有效的 Pandas 替代品,那就更值得称赞了。

name  Car_Hours  Car_Cycles    Car_Group     DeltaH
jan   101         404              1            55
jan   102         405              1            55
jan   103         406              1            56
jan   104         410              1            55
jan   105         411              1            56
jan     0          10              2            55 
jan     1          12              2            58
jan     2          14              2            57
jan     3          20              2            59
jan     4          26              2            55
jan    10          36              2            56
jan    15          42              2            57
jan    27          56              2            57
jan   100          61              2            58 
jan   500          68              2            58
jan     2           4              3            56
jan     3          15              3            57
pete  190          21              1            54
pete  211          29              1            58
pete  212          38              1            55
pete  304          43              1            56
pete   14          20              2            57
pete   15          27              2            57 
pete   36          38              2            58
pete  103          47              2            55
mike 1500        2001              1            55
mike 1512        2006              1            59
mike 1513        2012              1            58  
mike 1515        2016              1            57
mike 1516        2020              1            55 
mike 1517        2024              1            57
..............

for i in range(len(file)):
    if i == 0:

        DeltaH_limit = 57

        car_thresholds = 0
        car_threshold_counts = 0
        car_threshold_counts = 0
        car_change_true = 0         
        car_change_index_loc = i

        total_person_thresholds = 0
        person_alert_count = 0
        person_car_count = 1
        person_car_change_count = 0

        total_fleet_thresholds = 0
        fleet_alert_count = 0
        fleet_car_count = 1
        fleet_car_change_count = 0

        if  float(file['Delta_H'][i]) >= DeltaH_limit:
            car_threshold_counts += 1
            car_thresholds += 1
            total_person_thresholds += 1
            total_fleet_thresholds += 1


    elif i == 1:
        if  float(file['Delta_H'][i]) >= DeltaH_limit:
            car_threshold_counts += 1
            car_thresholds += 1
            total_person_thresholds += 1
            total_fleet_thresholds += 1

    elif i > 1:
        if file['name'][i] == file['name'][i-1]: #is same person?
            if  float(file['Delta_H'][i]) >= DeltaH_limit:
                car_threshold_counts += 1
                car_thresholds += 1
                total_person_thresholds += 1
                total_fleet_thresholds += 1
            else:
                car_threshold_counts = 0
            if car_threshold_counts == 3:
                car_threshold_counts += 1
                person_alert_count += 1
                fleet_alert_count += 1

            #Car Change??  Compare cycles and hours to look for reset
            if i+1 < len(file):
                if file['name'][i] == file['name'][i+1] == file['name'][i-1]:
                    if int(file['Car_Cycles'][i]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
                        if int(file['Car_Cycles'][i+1]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):

                            car_thresholds = 0
                            car_change_true = 1
                            car_threshold_counts = 0
                            car_threshold_counts = 0

                            old_pump_first_flight = car_change_index_loc
                            car_change_index_loc = i
                            old_pump_last_flight = i-1

                            person_car_count += 1
                            person_car_change_count += 1                                

                            fleet_car_count += 1
                            fleet_car_change_count += 1



                            print(i,  ' working hard!')

                        else:
                            car_change_true = 0
                    else:
                        car_change_true = 0
                else:
                    car_change_true = 0
            else:
                car_change_true = 0

        else: #new car
            car_thresholds = 0              
            car_threshold_counts = 0
            car_threshold_counts = 0
            car_change_index_loc = i                
            car_change_true = 0         

            total_person_thresholds = 0
            person_alert_count = 0
            person_car_count = 1
            person_car_change_count = 0


            if  float(file['Delta_H'][i]) >= DeltaH_limit:
                car_threshold_counts += 1
                car_thresholds += 1
                total_person_thresholds += 1
                total_fleet_thresholds += 1

    file.loc[i, 'car_thresholds'] = car_thresholds
    file.loc[i, 'car_threshold_counts'] = car_threshold_counts
    file.loc[i, 'car_threshold_counts'] = car_threshold_counts
    file.loc[i, 'car_change_true'] = car_change_true    
    file.loc[i, 'car_change_index_loc'] = car_change_index_loc  

    file.loc[i, 'total_person_thresholds'] = total_person_thresholds
    file.loc[i, 'person_alert_count'] = person_alert_count
    file.loc[i, 'person_car_count'] = person_car_count
    file.loc[i, 'person_car_change_count'] = person_car_change_count

    file.loc[i, 'Total_Fleet_Thresholds'] = total_fleet_thresholds
    file.loc[i, 'Fleet_Alert_Count'] = fleet_alert_count
    file.loc[i, 'fleet_car_count'] = fleet_car_count
    file.loc[i, 'fleet_car_change_count'] = fleet_car_change_count

IIUC,我们需要做的就是重现Car_Group,我们可以利用一些技巧:

def twolow(s):
    return (s < s.shift()) & (s.shift(-1) < s.shift())

new_hour = twolow(df["Car_Hours"])
new_cycle = twolow(df["Car_Cycles"])
new_name = df["name"] != df["name"].shift()
name_group = new_name.cumsum()
new_cargroup = new_name | (new_hour & new_cycle)
cargroup_without_reset = new_cargroup.cumsum()
cargroup = (cargroup_without_reset - 
            cargroup_without_reset.groupby(name_group).transform(min) + 1)

技巧 #1:如果您想找出转换发生的位置,请将某物与其自身的转换版本进行比较。

技巧 #2:如果您在每个新组开始的地方有一个 True,当您对其求和时,您会得到一个序列,其中每个组都有一个与之关联的整数。

以上给了我

>>> cargroup.head(10)
0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    2
9    2
dtype: int32
>>> (cargroup == df.Car_Group).all()
True