Complex Groupby Pandas 操作替换 For 循环和 If 语句
Complex Groupby Pandas Operation to Replace For Loops and If Statements
我有一组复杂的团体问题需要帮助。
我有 driver 的名字,随着时间的推移,他们每个人都开过好几辆车。每次他们打开汽车并开车时,我都会捕获周期和小时数,这些信息会被远程传输。
我想做的是使用分组来查看 driver 何时获得新车。
我正在使用 Car_Cycles 和 Car_Hours 来监视重置(新车)。每个 driver 的小时数和周期按升序制成表格,直到有新车并重新设置。我想把每辆车做成一个序列,但逻辑上只能通过cycle/hour重置来识别汽车。
我使用带有 if 语句的 for 循环在数据帧上执行此操作,处理时间需要几个小时。我有几十万行,每行包含大约 20 列。
我的数据来自通过中等可靠连接的传感器,因此我想使用以下条件进行过滤:只有当 Car_Hours 和 Car_Cycles 都小于前一个时,新组才有效连续 2 行的组的最后一行。使用两个输出并检查两行更改足以过滤所有错误数据。
如果有人能告诉我如何在不使用繁琐的 for 循环和 if 语句的情况下快速解决 Car_Group,我将不胜感激。
此外,对于那些非常有冒险精神的人,我在下面添加了带有 if 语句的原始 for 循环。请注意,我在每个组中做了一些其他数据 analysis/tracking 以查看汽车的其他行为。如果您敢于查看该代码并向我展示一个有效的 Pandas 替代品,那就更值得称赞了。
name Car_Hours Car_Cycles Car_Group DeltaH
jan 101 404 1 55
jan 102 405 1 55
jan 103 406 1 56
jan 104 410 1 55
jan 105 411 1 56
jan 0 10 2 55
jan 1 12 2 58
jan 2 14 2 57
jan 3 20 2 59
jan 4 26 2 55
jan 10 36 2 56
jan 15 42 2 57
jan 27 56 2 57
jan 100 61 2 58
jan 500 68 2 58
jan 2 4 3 56
jan 3 15 3 57
pete 190 21 1 54
pete 211 29 1 58
pete 212 38 1 55
pete 304 43 1 56
pete 14 20 2 57
pete 15 27 2 57
pete 36 38 2 58
pete 103 47 2 55
mike 1500 2001 1 55
mike 1512 2006 1 59
mike 1513 2012 1 58
mike 1515 2016 1 57
mike 1516 2020 1 55
mike 1517 2024 1 57
..............
for i in range(len(file)):
if i == 0:
DeltaH_limit = 57
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_true = 0
car_change_index_loc = i
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
total_fleet_thresholds = 0
fleet_alert_count = 0
fleet_car_count = 1
fleet_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i == 1:
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i > 1:
if file['name'][i] == file['name'][i-1]: #is same person?
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
else:
car_threshold_counts = 0
if car_threshold_counts == 3:
car_threshold_counts += 1
person_alert_count += 1
fleet_alert_count += 1
#Car Change?? Compare cycles and hours to look for reset
if i+1 < len(file):
if file['name'][i] == file['name'][i+1] == file['name'][i-1]:
if int(file['Car_Cycles'][i]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
if int(file['Car_Cycles'][i+1]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
car_thresholds = 0
car_change_true = 1
car_threshold_counts = 0
car_threshold_counts = 0
old_pump_first_flight = car_change_index_loc
car_change_index_loc = i
old_pump_last_flight = i-1
person_car_count += 1
person_car_change_count += 1
fleet_car_count += 1
fleet_car_change_count += 1
print(i, ' working hard!')
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else: #new car
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_index_loc = i
car_change_true = 0
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
file.loc[i, 'car_thresholds'] = car_thresholds
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_change_true'] = car_change_true
file.loc[i, 'car_change_index_loc'] = car_change_index_loc
file.loc[i, 'total_person_thresholds'] = total_person_thresholds
file.loc[i, 'person_alert_count'] = person_alert_count
file.loc[i, 'person_car_count'] = person_car_count
file.loc[i, 'person_car_change_count'] = person_car_change_count
file.loc[i, 'Total_Fleet_Thresholds'] = total_fleet_thresholds
file.loc[i, 'Fleet_Alert_Count'] = fleet_alert_count
file.loc[i, 'fleet_car_count'] = fleet_car_count
file.loc[i, 'fleet_car_change_count'] = fleet_car_change_count
IIUC,我们需要做的就是重现Car_Group,我们可以利用一些技巧:
def twolow(s):
return (s < s.shift()) & (s.shift(-1) < s.shift())
new_hour = twolow(df["Car_Hours"])
new_cycle = twolow(df["Car_Cycles"])
new_name = df["name"] != df["name"].shift()
name_group = new_name.cumsum()
new_cargroup = new_name | (new_hour & new_cycle)
cargroup_without_reset = new_cargroup.cumsum()
cargroup = (cargroup_without_reset -
cargroup_without_reset.groupby(name_group).transform(min) + 1)
技巧 #1:如果您想找出转换发生的位置,请将某物与其自身的转换版本进行比较。
技巧 #2:如果您在每个新组开始的地方有一个 True,当您对其求和时,您会得到一个序列,其中每个组都有一个与之关联的整数。
以上给了我
>>> cargroup.head(10)
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
dtype: int32
>>> (cargroup == df.Car_Group).all()
True
我有一组复杂的团体问题需要帮助。
我有 driver 的名字,随着时间的推移,他们每个人都开过好几辆车。每次他们打开汽车并开车时,我都会捕获周期和小时数,这些信息会被远程传输。
我想做的是使用分组来查看 driver 何时获得新车。 我正在使用 Car_Cycles 和 Car_Hours 来监视重置(新车)。每个 driver 的小时数和周期按升序制成表格,直到有新车并重新设置。我想把每辆车做成一个序列,但逻辑上只能通过cycle/hour重置来识别汽车。
我使用带有 if 语句的 for 循环在数据帧上执行此操作,处理时间需要几个小时。我有几十万行,每行包含大约 20 列。
我的数据来自通过中等可靠连接的传感器,因此我想使用以下条件进行过滤:只有当 Car_Hours 和 Car_Cycles 都小于前一个时,新组才有效连续 2 行的组的最后一行。使用两个输出并检查两行更改足以过滤所有错误数据。
如果有人能告诉我如何在不使用繁琐的 for 循环和 if 语句的情况下快速解决 Car_Group,我将不胜感激。
此外,对于那些非常有冒险精神的人,我在下面添加了带有 if 语句的原始 for 循环。请注意,我在每个组中做了一些其他数据 analysis/tracking 以查看汽车的其他行为。如果您敢于查看该代码并向我展示一个有效的 Pandas 替代品,那就更值得称赞了。
name Car_Hours Car_Cycles Car_Group DeltaH
jan 101 404 1 55
jan 102 405 1 55
jan 103 406 1 56
jan 104 410 1 55
jan 105 411 1 56
jan 0 10 2 55
jan 1 12 2 58
jan 2 14 2 57
jan 3 20 2 59
jan 4 26 2 55
jan 10 36 2 56
jan 15 42 2 57
jan 27 56 2 57
jan 100 61 2 58
jan 500 68 2 58
jan 2 4 3 56
jan 3 15 3 57
pete 190 21 1 54
pete 211 29 1 58
pete 212 38 1 55
pete 304 43 1 56
pete 14 20 2 57
pete 15 27 2 57
pete 36 38 2 58
pete 103 47 2 55
mike 1500 2001 1 55
mike 1512 2006 1 59
mike 1513 2012 1 58
mike 1515 2016 1 57
mike 1516 2020 1 55
mike 1517 2024 1 57
..............
for i in range(len(file)):
if i == 0:
DeltaH_limit = 57
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_true = 0
car_change_index_loc = i
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
total_fleet_thresholds = 0
fleet_alert_count = 0
fleet_car_count = 1
fleet_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i == 1:
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
elif i > 1:
if file['name'][i] == file['name'][i-1]: #is same person?
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
else:
car_threshold_counts = 0
if car_threshold_counts == 3:
car_threshold_counts += 1
person_alert_count += 1
fleet_alert_count += 1
#Car Change?? Compare cycles and hours to look for reset
if i+1 < len(file):
if file['name'][i] == file['name'][i+1] == file['name'][i-1]:
if int(file['Car_Cycles'][i]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
if int(file['Car_Cycles'][i+1]) < int(file['Car_Cycles'][i-1]) and int(file['Car_Hours'][i]) < int(file['Car_Hours'][i-1]):
car_thresholds = 0
car_change_true = 1
car_threshold_counts = 0
car_threshold_counts = 0
old_pump_first_flight = car_change_index_loc
car_change_index_loc = i
old_pump_last_flight = i-1
person_car_count += 1
person_car_change_count += 1
fleet_car_count += 1
fleet_car_change_count += 1
print(i, ' working hard!')
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else:
car_change_true = 0
else: #new car
car_thresholds = 0
car_threshold_counts = 0
car_threshold_counts = 0
car_change_index_loc = i
car_change_true = 0
total_person_thresholds = 0
person_alert_count = 0
person_car_count = 1
person_car_change_count = 0
if float(file['Delta_H'][i]) >= DeltaH_limit:
car_threshold_counts += 1
car_thresholds += 1
total_person_thresholds += 1
total_fleet_thresholds += 1
file.loc[i, 'car_thresholds'] = car_thresholds
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_threshold_counts'] = car_threshold_counts
file.loc[i, 'car_change_true'] = car_change_true
file.loc[i, 'car_change_index_loc'] = car_change_index_loc
file.loc[i, 'total_person_thresholds'] = total_person_thresholds
file.loc[i, 'person_alert_count'] = person_alert_count
file.loc[i, 'person_car_count'] = person_car_count
file.loc[i, 'person_car_change_count'] = person_car_change_count
file.loc[i, 'Total_Fleet_Thresholds'] = total_fleet_thresholds
file.loc[i, 'Fleet_Alert_Count'] = fleet_alert_count
file.loc[i, 'fleet_car_count'] = fleet_car_count
file.loc[i, 'fleet_car_change_count'] = fleet_car_change_count
IIUC,我们需要做的就是重现Car_Group,我们可以利用一些技巧:
def twolow(s):
return (s < s.shift()) & (s.shift(-1) < s.shift())
new_hour = twolow(df["Car_Hours"])
new_cycle = twolow(df["Car_Cycles"])
new_name = df["name"] != df["name"].shift()
name_group = new_name.cumsum()
new_cargroup = new_name | (new_hour & new_cycle)
cargroup_without_reset = new_cargroup.cumsum()
cargroup = (cargroup_without_reset -
cargroup_without_reset.groupby(name_group).transform(min) + 1)
技巧 #1:如果您想找出转换发生的位置,请将某物与其自身的转换版本进行比较。
技巧 #2:如果您在每个新组开始的地方有一个 True,当您对其求和时,您会得到一个序列,其中每个组都有一个与之关联的整数。
以上给了我
>>> cargroup.head(10)
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 2
9 2
dtype: int32
>>> (cargroup == df.Car_Group).all()
True