开始结束比较
start end comparison
我在计算每一行的开始和结束连续的记录 ID 总数时遇到了问题。连续意味着当一行在前一行结束之前开始并且 Name == Name 时。记录 ID 1-3 是连续的,因为它们重叠并且具有连续的 start/end 日期时间。
我只想在连续冲突总数 > = 3 时显示 TRUE,否则显示 FALSE。
import pandas as pd
import io
#SAMPLE DATA 1 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;SMITH, JOHN ;10/20/20 1:00 AM;10/20/20 2:15 PM
5;SMITH, JOHN;10/20/20 2:00 PM;10/20/20 4:00 PM
"""),sep=';')
# SAMPLE DATA 2 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/4/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/4/20 11:20 AM;10/20/20 12:00 PM
4;SMITH, JOHN ;10/4/20 1:00 PM;10/20/20 2:15 PM
5;SMITH, JOHN;10/4/20 3:15 PM;10/20/20 4:00 PM
"""),sep=';')
df['Start'] = df['Record Start']
df['End'] = df['Record End']
df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')
df['overlap?'] = False
print(df)
Expected Output for Sample Data 1:
Record ID Record Name ... overlap? total records consec >=3?
0 1 SMITH, JOHN ... True True
1 2 SMITH, JOHN ... True True
2 3 SMITH, JOHN ... True True
3 4 SMITH, JOHN ... True False
4 5 SMITH, JOHN ... True False
Expected Output for Sample Data 2:
Record ID Record Name ... overlap? total records consec >=3?
0 1 SMITH, JOHN ... True False
1 2 SMITH, JOHN ... True False
2 3 SMITH, JOHN ... True False
3 4 SMITH, JOHN ... False False
4 5 SMITH, JOHN ... True False
这给出了误报。它只是按名称和日期分组并计算重叠的数量。但不看那些重叠是否连续。
更新:
根据建议的答案,我将以下代码添加到它的末尾,以在计数连续时获得真值或假值。 (对于任何感兴趣的人)。不幸的是,示例数据 3 的解决方案(对应部分)不起作用。
prev= -1
consecutive = []
for i, v in enumerate(df['Count'].values):
if v <= prev:
if prev >= 3:
consecutive += prev * [True]
else:
consecutive += prev * [False]
elif len(df) == i + 1:
if prev >= 3:
consecutive += v * [True]
else:
consecutive += v * [False]
prev = v
df[['Consecutive']] = consecutive
# SAMPLE DATA 3 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/5/20 7:47 AM;10/5/20 8:05 AM
2;SMITH, JOHN;10/5/20 11:43 AM;10/5/20 1:26 AM
3;SMITH, JOHN;10/5/20 12:48 AM;10/5/20 1:31 PM
4;SMITH, JOHN ;10/5/20 2:50 PM;10/5/20 5:00 PM
"""),sep=';')
Current Output:
Event ID Name Event Date ... End2 overlap Count
0 1 SMITH, JOHN 2021-10-05 ... 2021-10-05 08:05:00 False 1
1 2 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:26:00 True 2
2 3 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:31:00 True 3
3 4 SMITH, JOHN 2021-10-05 ... 2021-10-05 17:53:00 False 4
Expected Output:
Event ID Name Event Date ... End2 overlap Count
0 1 SMITH, JOHN 2021-10-05 ... 2021-10-05 08:05:00 False 1
1 2 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:26:00 True 1
2 3 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:31:00 True 2
3 4 SMITH, JOHN 2021-10-05 ... 2021-10-05 17:53:00 False 1
预期输出的推理:
- 事件 1 不与任何其他事件冲突。 Count =1(从 1 开始)并且 overlap = False
- 事件 2 和 3 相互重叠。事件 ID 2 的计数设置回 1,事件 ID 3 的计数设置回 2。Overlap = True 两者。
- 事件 4 不与任何其他事件重叠。计数设置回 1。Overlap = false
样本数据 4
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 12:00 AM;10/4/20 7:00 PM
2;SMITH, JOHN;10/4/20 8:00 AM;10/4/20 9:00 AM
3;SMITH, JOHN;10/4/20 10:00 AM AM;10/4/20 11:00 AM
4;SMITH, JOHN ;10/4/20 4:30 PM;10/4/20 5:00 PM
"""),sep=';')
Current Output:
Record Start Record End overlap Count
0 2021-10-04 02:00:00 2021-10-04 19:53:00 True 1
1 2021-10-04 08:05:00 2021-10-04 08:47:00 True 2
2 2021-10-04 09:55:00 2021-10-04 10:36:00 True 1
3 2021-10-04 13:19:00 2021-10-04 14:15:00 True 1
4 2021-10-04 16:39:00 2021-10-04 17:07:00 True 1
Expected Output:
Record Start Record End overlap Count
0 2021-10-04 02:00:00 2021-10-04 19:53:00 True 1
1 2021-10-04 08:05:00 2021-10-04 08:47:00 True 2
2 2021-10-04 09:55:00 2021-10-04 10:36:00 True 3
3 2021-10-04 13:19:00 2021-10-04 14:15:00 True 4
4 2021-10-04 16:39:00 2021-10-04 17:07:00 True 5
使用 dataframe.loc 获取当前行和上一行,如果日期相等,则将 1 加到上一行计数列,否则,如果不相等,则将计数设置为 1。过滤数据框中的所有行计数大于 3。您还可以根据姓名和日期计算 运行 总数。
我在我的解决方案中经常使用 timedelta。我在开始日期时间和结束日期时间之间使用 total_seconds,然后除以 60 得到分钟数,并将其添加到开始时间以创建从一分钟间隔开始的日期时间偏移量。
apply 创建开始和结束日期时间之间的分钟间隔。
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;COOPER, ALLEN;10/20/20 1:00 PM;10/20/20 2:15 PM
5;PEREZ, HILL;10/20/20 3:15 PM;10/20/20 4:00 PM
6;SMITH, JOHN;10/4/21 8:00 AM;10/20/21 9:30 AM
7;SMITH, JOHN;10/4/21 9:20 AM;10/20/21 10:30 AM
8;SMITH, JOHN;10/4/21 11:20 AM;10/20/21 12:00 PM
9;SMITH, JOHN ;10/4/21 1:00 PM;10/20/21 2:15 PM
10;SMITH, JOHN;10/4/21 3:15 PM;10/20/21 4:00 PM
"""),sep=';')
df['Record Start']=pd.to_datetime(df['Record Start'])
df['Record End']=pd.to_datetime(df['Record End'])
def create_datetime(date,hour,minute,second):
month=date.month
day=date.day
year=date.year
return datetime(year=year,month=month,day=day,hour=hour,minute=minute,second=second,microsecond=0)
def get_minutes(row):
start=row['Record Start']
end = row['Record End']
results=[start + timedelta(minutes=x) for x in range(0, round((end-start).total_seconds()//60)+1)]
#for item in results:
# print(item)
#sys.exit()
return results
df['minutes'] = df.apply(get_minutes, axis=1)
def intersection(lst1, lst2):
return list(set(lst1) & set(lst2))
prev_row=None
for index,row in df.iterrows():
if index==0:
df.loc[index,'Count']=1
else:
prev_row=df.iloc[index-1]
if not prev_row is None:
if prev_row['Record Name']==row['Record Name']:
count=prev_row['Count']
lst1=row['minutes']
lst2=prev_row['minutes']
if len(intersection(lst1,lst2))>0:
df.loc[index,'Count']=count+1
else:
df.loc[index,'Count']=1
else:
df.loc[index,'Count']=1
#print(df[df['Count']>=3])
print(df)
输出:
Record ID Record Name Record Start Record End Count
0 1 SMITH, JOHN 10/20/20 8:00 AM 10/20/20 9:30 AM 1.0
1 2 SMITH, JOHN 10/20/20 9:20 AM 10/20/20 10:30 AM 2.0
2 3 SMITH, JOHN 10/20/20 10:20 AM 10/20/20 11:00 AM 3.0
3 4 COOPER, ALLEN 10/20/20 1:00 PM 10/20/20 2:15 PM 1.0
4 5 PEREZ, HILL 10/20/20 3:15 PM 10/20/20 4:00 PM 1.0
5 6 SMITH, JOHN 10/4/21 8:00 AM 10/20/21 9:30 AM 1.0
6 7 SMITH, JOHN 10/4/21 9:20 AM 10/20/21 10:30 AM 2.0
7 8 SMITH, JOHN 10/4/21 11:20 AM 10/20/21 12:00 PM 3.0
8 9 SMITH, JOHN 10/4/21 1:00 PM 10/20/21 2:15 PM 1.0
9 10 SMITH, JOHN 10/4/21 3:15 PM 10/20/21 4:00 PM 1.0
我终于找到了有效的解决方案。以下为未来的任何人:
df = pd.read_excel(r'PATH\FILE')
# Create new columns for Start/End values
df['Start'] = df['Record Start']
df['End'] = df['Record End']
# Convert to pandas datetime
df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')
# set static values
nest = []
flat = []
# Find overlapping events
df['overlap'] = False
for i, row in df.iterrows():
if i in flat:
continue
start, end = row["Start"], row["End"]
flag = True
counter = 0
while flag:
counter += 1
res = df.loc[(df['Name'] == row['Name']) &
(((df["Start"] >= start) & (df["Start"] <= end)) |
((df["End"] >= start) & (df["End"] <= end)) |
((end >= df["Start"]) & (end <= df["End"])) |
((start >= df["Start"]) & (start <= df["End"])))].index.tolist()
resbkup = res
res += [i]
temp_df = df.loc[res]
temp_start = temp_df['Start'].min()
temp_end = temp_df['End'].max()
if counter > 50:
print("True -- ",start,end, resbkup)
print("temp", temp_start, temp_end)
print(flag, res)
if ((temp_start == start) and (temp_end == end)):
flag = False
else:
start, end = temp_start, temp_end
res = list(set(res))
res.sort()
nest.append(res)
flat += [j for j in res]
for i, n in enumerate(nest):
if len(n) >1:
df.loc[n, 'overlapIndex'] = "OverLap" +str(int(i+1))
df.loc[n, 'overlap'] = True
else:
df.loc[n, 'overlap'] = False
if len(n) >= 3:
print(n)
df.loc[n, 'Consecutive'] = True
else:
df.loc[n, 'Consecutive'] = False
print(df)
我在计算每一行的开始和结束连续的记录 ID 总数时遇到了问题。连续意味着当一行在前一行结束之前开始并且 Name == Name 时。记录 ID 1-3 是连续的,因为它们重叠并且具有连续的 start/end 日期时间。
我只想在连续冲突总数 > = 3 时显示 TRUE,否则显示 FALSE。
import pandas as pd
import io
#SAMPLE DATA 1 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;SMITH, JOHN ;10/20/20 1:00 AM;10/20/20 2:15 PM
5;SMITH, JOHN;10/20/20 2:00 PM;10/20/20 4:00 PM
"""),sep=';')
# SAMPLE DATA 2 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/4/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/4/20 11:20 AM;10/20/20 12:00 PM
4;SMITH, JOHN ;10/4/20 1:00 PM;10/20/20 2:15 PM
5;SMITH, JOHN;10/4/20 3:15 PM;10/20/20 4:00 PM
"""),sep=';')
df['Start'] = df['Record Start']
df['End'] = df['Record End']
df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')
df['overlap?'] = False
print(df)
Expected Output for Sample Data 1:
Record ID Record Name ... overlap? total records consec >=3?
0 1 SMITH, JOHN ... True True
1 2 SMITH, JOHN ... True True
2 3 SMITH, JOHN ... True True
3 4 SMITH, JOHN ... True False
4 5 SMITH, JOHN ... True False
Expected Output for Sample Data 2:
Record ID Record Name ... overlap? total records consec >=3?
0 1 SMITH, JOHN ... True False
1 2 SMITH, JOHN ... True False
2 3 SMITH, JOHN ... True False
3 4 SMITH, JOHN ... False False
4 5 SMITH, JOHN ... True False
这给出了误报。它只是按名称和日期分组并计算重叠的数量。但不看那些重叠是否连续。
更新: 根据建议的答案,我将以下代码添加到它的末尾,以在计数连续时获得真值或假值。 (对于任何感兴趣的人)。不幸的是,示例数据 3 的解决方案(对应部分)不起作用。
prev= -1
consecutive = []
for i, v in enumerate(df['Count'].values):
if v <= prev:
if prev >= 3:
consecutive += prev * [True]
else:
consecutive += prev * [False]
elif len(df) == i + 1:
if prev >= 3:
consecutive += v * [True]
else:
consecutive += v * [False]
prev = v
df[['Consecutive']] = consecutive
# SAMPLE DATA 3 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/5/20 7:47 AM;10/5/20 8:05 AM
2;SMITH, JOHN;10/5/20 11:43 AM;10/5/20 1:26 AM
3;SMITH, JOHN;10/5/20 12:48 AM;10/5/20 1:31 PM
4;SMITH, JOHN ;10/5/20 2:50 PM;10/5/20 5:00 PM
"""),sep=';')
Current Output:
Event ID Name Event Date ... End2 overlap Count
0 1 SMITH, JOHN 2021-10-05 ... 2021-10-05 08:05:00 False 1
1 2 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:26:00 True 2
2 3 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:31:00 True 3
3 4 SMITH, JOHN 2021-10-05 ... 2021-10-05 17:53:00 False 4
Expected Output:
Event ID Name Event Date ... End2 overlap Count
0 1 SMITH, JOHN 2021-10-05 ... 2021-10-05 08:05:00 False 1
1 2 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:26:00 True 1
2 3 SMITH, JOHN 2021-10-05 ... 2021-10-05 13:31:00 True 2
3 4 SMITH, JOHN 2021-10-05 ... 2021-10-05 17:53:00 False 1
预期输出的推理:
- 事件 1 不与任何其他事件冲突。 Count =1(从 1 开始)并且 overlap = False
- 事件 2 和 3 相互重叠。事件 ID 2 的计数设置回 1,事件 ID 3 的计数设置回 2。Overlap = True 两者。
- 事件 4 不与任何其他事件重叠。计数设置回 1。Overlap = false
样本数据 4
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 12:00 AM;10/4/20 7:00 PM
2;SMITH, JOHN;10/4/20 8:00 AM;10/4/20 9:00 AM
3;SMITH, JOHN;10/4/20 10:00 AM AM;10/4/20 11:00 AM
4;SMITH, JOHN ;10/4/20 4:30 PM;10/4/20 5:00 PM
"""),sep=';')
Current Output:
Record Start Record End overlap Count
0 2021-10-04 02:00:00 2021-10-04 19:53:00 True 1
1 2021-10-04 08:05:00 2021-10-04 08:47:00 True 2
2 2021-10-04 09:55:00 2021-10-04 10:36:00 True 1
3 2021-10-04 13:19:00 2021-10-04 14:15:00 True 1
4 2021-10-04 16:39:00 2021-10-04 17:07:00 True 1
Expected Output:
Record Start Record End overlap Count
0 2021-10-04 02:00:00 2021-10-04 19:53:00 True 1
1 2021-10-04 08:05:00 2021-10-04 08:47:00 True 2
2 2021-10-04 09:55:00 2021-10-04 10:36:00 True 3
3 2021-10-04 13:19:00 2021-10-04 14:15:00 True 4
4 2021-10-04 16:39:00 2021-10-04 17:07:00 True 5
使用 dataframe.loc 获取当前行和上一行,如果日期相等,则将 1 加到上一行计数列,否则,如果不相等,则将计数设置为 1。过滤数据框中的所有行计数大于 3。您还可以根据姓名和日期计算 运行 总数。
我在我的解决方案中经常使用 timedelta。我在开始日期时间和结束日期时间之间使用 total_seconds,然后除以 60 得到分钟数,并将其添加到开始时间以创建从一分钟间隔开始的日期时间偏移量。
apply 创建开始和结束日期时间之间的分钟间隔。
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;COOPER, ALLEN;10/20/20 1:00 PM;10/20/20 2:15 PM
5;PEREZ, HILL;10/20/20 3:15 PM;10/20/20 4:00 PM
6;SMITH, JOHN;10/4/21 8:00 AM;10/20/21 9:30 AM
7;SMITH, JOHN;10/4/21 9:20 AM;10/20/21 10:30 AM
8;SMITH, JOHN;10/4/21 11:20 AM;10/20/21 12:00 PM
9;SMITH, JOHN ;10/4/21 1:00 PM;10/20/21 2:15 PM
10;SMITH, JOHN;10/4/21 3:15 PM;10/20/21 4:00 PM
"""),sep=';')
df['Record Start']=pd.to_datetime(df['Record Start'])
df['Record End']=pd.to_datetime(df['Record End'])
def create_datetime(date,hour,minute,second):
month=date.month
day=date.day
year=date.year
return datetime(year=year,month=month,day=day,hour=hour,minute=minute,second=second,microsecond=0)
def get_minutes(row):
start=row['Record Start']
end = row['Record End']
results=[start + timedelta(minutes=x) for x in range(0, round((end-start).total_seconds()//60)+1)]
#for item in results:
# print(item)
#sys.exit()
return results
df['minutes'] = df.apply(get_minutes, axis=1)
def intersection(lst1, lst2):
return list(set(lst1) & set(lst2))
prev_row=None
for index,row in df.iterrows():
if index==0:
df.loc[index,'Count']=1
else:
prev_row=df.iloc[index-1]
if not prev_row is None:
if prev_row['Record Name']==row['Record Name']:
count=prev_row['Count']
lst1=row['minutes']
lst2=prev_row['minutes']
if len(intersection(lst1,lst2))>0:
df.loc[index,'Count']=count+1
else:
df.loc[index,'Count']=1
else:
df.loc[index,'Count']=1
#print(df[df['Count']>=3])
print(df)
输出:
Record ID Record Name Record Start Record End Count
0 1 SMITH, JOHN 10/20/20 8:00 AM 10/20/20 9:30 AM 1.0
1 2 SMITH, JOHN 10/20/20 9:20 AM 10/20/20 10:30 AM 2.0
2 3 SMITH, JOHN 10/20/20 10:20 AM 10/20/20 11:00 AM 3.0
3 4 COOPER, ALLEN 10/20/20 1:00 PM 10/20/20 2:15 PM 1.0
4 5 PEREZ, HILL 10/20/20 3:15 PM 10/20/20 4:00 PM 1.0
5 6 SMITH, JOHN 10/4/21 8:00 AM 10/20/21 9:30 AM 1.0
6 7 SMITH, JOHN 10/4/21 9:20 AM 10/20/21 10:30 AM 2.0
7 8 SMITH, JOHN 10/4/21 11:20 AM 10/20/21 12:00 PM 3.0
8 9 SMITH, JOHN 10/4/21 1:00 PM 10/20/21 2:15 PM 1.0
9 10 SMITH, JOHN 10/4/21 3:15 PM 10/20/21 4:00 PM 1.0
我终于找到了有效的解决方案。以下为未来的任何人:
df = pd.read_excel(r'PATH\FILE')
# Create new columns for Start/End values
df['Start'] = df['Record Start']
df['End'] = df['Record End']
# Convert to pandas datetime
df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')
# set static values
nest = []
flat = []
# Find overlapping events
df['overlap'] = False
for i, row in df.iterrows():
if i in flat:
continue
start, end = row["Start"], row["End"]
flag = True
counter = 0
while flag:
counter += 1
res = df.loc[(df['Name'] == row['Name']) &
(((df["Start"] >= start) & (df["Start"] <= end)) |
((df["End"] >= start) & (df["End"] <= end)) |
((end >= df["Start"]) & (end <= df["End"])) |
((start >= df["Start"]) & (start <= df["End"])))].index.tolist()
resbkup = res
res += [i]
temp_df = df.loc[res]
temp_start = temp_df['Start'].min()
temp_end = temp_df['End'].max()
if counter > 50:
print("True -- ",start,end, resbkup)
print("temp", temp_start, temp_end)
print(flag, res)
if ((temp_start == start) and (temp_end == end)):
flag = False
else:
start, end = temp_start, temp_end
res = list(set(res))
res.sort()
nest.append(res)
flat += [j for j in res]
for i, n in enumerate(nest):
if len(n) >1:
df.loc[n, 'overlapIndex'] = "OverLap" +str(int(i+1))
df.loc[n, 'overlap'] = True
else:
df.loc[n, 'overlap'] = False
if len(n) >= 3:
print(n)
df.loc[n, 'Consecutive'] = True
else:
df.loc[n, 'Consecutive'] = False
print(df)