开始结束比较

Question

我在计算每一行的开始和结束连续的记录 ID 总数时遇到了问题。连续意味着当一行在前一行结束之前开始并且 Name == Name 时。记录 ID 1-3 是连续的，因为它们重叠并且具有连续的 start/end 日期时间。

我只想在连续冲突总数 > = 3 时显示 TRUE，否则显示 FALSE。

import pandas as pd
import io

#SAMPLE DATA 1 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;SMITH, JOHN ;10/20/20 1:00 AM;10/20/20 2:15 PM
5;SMITH, JOHN;10/20/20 2:00 PM;10/20/20 4:00 PM
"""),sep=';')

# SAMPLE DATA 2 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/4/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/4/20 11:20 AM;10/20/20 12:00 PM
4;SMITH, JOHN ;10/4/20 1:00 PM;10/20/20 2:15 PM
5;SMITH, JOHN;10/4/20 3:15 PM;10/20/20 4:00 PM
"""),sep=';')

df['Start'] = df['Record Start']
df['End'] = df['Record End']

df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')

df['overlap?'] =  False


print(df)

Expected Output for Sample Data 1:
   Record ID     Record Name  ... overlap? total records consec >=3?
0          1     SMITH, JOHN  ...     True                    True
1          2     SMITH, JOHN  ...     True                    True
2          3     SMITH, JOHN  ...     True                    True
3          4    SMITH, JOHN   ...     True                    False
4          5     SMITH, JOHN  ...     True                    False



Expected Output for Sample Data 2:

   Record ID   Record Name  ... overlap? total records consec >=3?
0          1   SMITH, JOHN  ...     True                    False
1          2   SMITH, JOHN  ...     True                    False
2          3   SMITH, JOHN  ...     True                    False
3          4  SMITH, JOHN   ...    False                    False
4          5   SMITH, JOHN  ...     True                    False

这给出了误报。它只是按名称和日期分组并计算重叠的数量。但不看那些重叠是否连续。

更新：根据建议的答案，我将以下代码添加到它的末尾，以在计数连续时获得真值或假值。（对于任何感兴趣的人）。不幸的是，示例数据 3 的解决方案（对应部分）不起作用。

prev= -1
consecutive = []
for i, v in enumerate(df['Count'].values):
    if v <= prev:
        if prev >= 3:
            consecutive += prev * [True]
        else:
            consecutive += prev * [False]
    elif len(df) == i + 1:
        if prev >= 3:
            consecutive += v * [True]
        else:
            consecutive += v * [False]
    prev = v

df[['Consecutive']] = consecutive


# SAMPLE DATA 3 DF
df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/5/20 7:47 AM;10/5/20 8:05 AM
2;SMITH, JOHN;10/5/20 11:43 AM;10/5/20 1:26 AM
3;SMITH, JOHN;10/5/20 12:48 AM;10/5/20 1:31 PM
4;SMITH, JOHN ;10/5/20 2:50 PM;10/5/20 5:00 PM
"""),sep=';')

Current Output: 

Event ID         Name Event Date  ...                End2 overlap Count
0         1  SMITH, JOHN 2021-10-05  ... 2021-10-05 08:05:00   False     1
1         2  SMITH, JOHN 2021-10-05  ... 2021-10-05 13:26:00    True     2
2         3  SMITH, JOHN 2021-10-05  ... 2021-10-05 13:31:00    True     3
3         4  SMITH, JOHN 2021-10-05  ... 2021-10-05 17:53:00   False     4

Expected Output:

Event ID         Name Event Date  ...                End2 overlap Count
0         1  SMITH, JOHN 2021-10-05  ... 2021-10-05 08:05:00   False     1
1         2  SMITH, JOHN 2021-10-05  ... 2021-10-05 13:26:00    True     1
2         3  SMITH, JOHN 2021-10-05  ... 2021-10-05 13:31:00    True     2
3         4  SMITH, JOHN 2021-10-05  ... 2021-10-05 17:53:00   False     1

预期输出的推理：

事件 1 不与任何其他事件冲突。 Count =1（从 1 开始）并且 overlap = False
事件 2 和 3 相互重叠。事件 ID 2 的计数设置回 1，事件 ID 3 的计数设置回 2。Overlap = True 两者。
事件 4 不与任何其他事件重叠。计数设置回 1。Overlap = false

样本数据 4

df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/4/20 12:00 AM;10/4/20 7:00 PM
2;SMITH, JOHN;10/4/20 8:00 AM;10/4/20 9:00 AM
3;SMITH, JOHN;10/4/20 10:00 AM AM;10/4/20 11:00 AM
4;SMITH, JOHN ;10/4/20 4:30 PM;10/4/20 5:00 PM
"""),sep=';')

Current Output:

         Record Start          Record End  overlap  Count
0 2021-10-04 02:00:00 2021-10-04 19:53:00     True      1
1 2021-10-04 08:05:00 2021-10-04 08:47:00     True      2
2 2021-10-04 09:55:00 2021-10-04 10:36:00     True      1
3 2021-10-04 13:19:00 2021-10-04 14:15:00     True      1
4 2021-10-04 16:39:00 2021-10-04 17:07:00     True      1

Expected Output:

         Record Start          Record End  overlap  Count
0 2021-10-04 02:00:00 2021-10-04 19:53:00     True      1
1 2021-10-04 08:05:00 2021-10-04 08:47:00     True      2
2 2021-10-04 09:55:00 2021-10-04 10:36:00     True      3
3 2021-10-04 13:19:00 2021-10-04 14:15:00     True      4
4 2021-10-04 16:39:00 2021-10-04 17:07:00     True      5

Answer 1

使用 dataframe.loc 获取当前行和上一行，如果日期相等，则将 1 加到上一行计数列，否则，如果不相等，则将计数设置为 1。过滤数据框中的所有行计数大于 3。您还可以根据姓名和日期计算运行总数。

我在我的解决方案中经常使用 timedelta。我在开始日期时间和结束日期时间之间使用 total_seconds，然后除以 60 得到分钟数，并将其添加到开始时间以创建从一分钟间隔开始的日期时间偏移量。

apply 创建开始和结束日期时间之间的分钟间隔。

df = pd.read_csv(io.StringIO("""
Record ID;Record Name;Record Start;Record End
1;SMITH, JOHN;10/20/20 8:00 AM;10/20/20 9:30 AM
2;SMITH, JOHN;10/20/20 9:20 AM;10/20/20 10:30 AM
3;SMITH, JOHN;10/20/20 10:20 AM;10/20/20 11:00 AM
4;COOPER, ALLEN;10/20/20 1:00 PM;10/20/20 2:15 PM
5;PEREZ, HILL;10/20/20 3:15 PM;10/20/20 4:00 PM
6;SMITH, JOHN;10/4/21 8:00 AM;10/20/21 9:30 AM
7;SMITH, JOHN;10/4/21 9:20 AM;10/20/21 10:30 AM
8;SMITH, JOHN;10/4/21 11:20 AM;10/20/21 12:00 PM
9;SMITH, JOHN ;10/4/21 1:00 PM;10/20/21 2:15 PM
10;SMITH, JOHN;10/4/21 3:15 PM;10/20/21 4:00 PM
"""),sep=';')

df['Record Start']=pd.to_datetime(df['Record Start'])
df['Record End']=pd.to_datetime(df['Record End'])
def create_datetime(date,hour,minute,second):
    month=date.month
    day=date.day
    year=date.year
    return datetime(year=year,month=month,day=day,hour=hour,minute=minute,second=second,microsecond=0)
def get_minutes(row):
    start=row['Record Start']
    end = row['Record End']

    results=[start + timedelta(minutes=x) for x in range(0, round((end-start).total_seconds()//60)+1)]
    
    #for item in results:
    #    print(item)
    #sys.exit()
    return results

df['minutes'] = df.apply(get_minutes, axis=1)

def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))

prev_row=None
for index,row in df.iterrows():
    if index==0:
        df.loc[index,'Count']=1
    else:
        prev_row=df.iloc[index-1]
        
    if not prev_row is None:
        if prev_row['Record Name']==row['Record Name']:
            count=prev_row['Count']
            lst1=row['minutes']
            lst2=prev_row['minutes']
            if len(intersection(lst1,lst2))>0:
                df.loc[index,'Count']=count+1
            else:
                df.loc[index,'Count']=1
        else:
            df.loc[index,'Count']=1
        
    #print(df[df['Count']>=3])   
    print(df)

输出：

 Record ID    Record Name       Record Start         Record End  Count
 0          1    SMITH, JOHN   10/20/20 8:00 AM   10/20/20 9:30 AM         1.0
 1          2    SMITH, JOHN   10/20/20 9:20 AM  10/20/20 10:30 AM    2.0
 2          3    SMITH, JOHN  10/20/20 10:20 AM  10/20/20 11:00 AM    3.0
 3          4  COOPER, ALLEN   10/20/20 1:00 PM   10/20/20 2:15 PM    1.0
 4          5    PEREZ, HILL   10/20/20 3:15 PM   10/20/20 4:00 PM    1.0
 5          6    SMITH, JOHN    10/4/21 8:00 AM   10/20/21 9:30 AM    1.0
 6          7    SMITH, JOHN    10/4/21 9:20 AM  10/20/21 10:30 AM    2.0
 7          8    SMITH, JOHN   10/4/21 11:20 AM  10/20/21 12:00 PM    3.0
 8          9   SMITH, JOHN     10/4/21 1:00 PM   10/20/21 2:15 PM    1.0
 9         10    SMITH, JOHN    10/4/21 3:15 PM   10/20/21 4:00 PM    1.0

Answer 2

我终于找到了有效的解决方案。以下为未来的任何人：

df = pd.read_excel(r'PATH\FILE')   

# Create new columns for Start/End values
df['Start'] = df['Record Start']
df['End'] = df['Record End']

# Convert to pandas datetime
df['Start'] = pd.to_datetime(df['Start'], errors='coerce')
df['End'] = pd.to_datetime(df['End'], errors='coerce')

# set static values
nest = []
flat = []

# Find overlapping events
df['overlap'] = False

for i, row in df.iterrows():
    if i in flat:
        continue
    start, end = row["Start"], row["End"]
    flag = True
    counter = 0

    while flag:
        counter += 1
        res = df.loc[(df['Name'] == row['Name']) &
                     (((df["Start"] >= start) & (df["Start"] <= end)) |
                      ((df["End"] >= start) & (df["End"] <= end)) |
                      ((end >= df["Start"]) & (end <= df["End"])) |
                      ((start >= df["Start"]) & (start <= df["End"])))].index.tolist()
        resbkup = res
        res += [i]
        temp_df = df.loc[res]
        temp_start = temp_df['Start'].min()
        temp_end = temp_df['End'].max()
        if counter > 50:
            print("True -- ",start,end, resbkup)
            print("temp", temp_start,  temp_end)
            print(flag, res)
        if ((temp_start == start) and (temp_end == end)):
            flag = False
        else:
            start, end = temp_start, temp_end

    res = list(set(res))
    res.sort()
    nest.append(res)
    flat += [j for j in res]

for i, n in enumerate(nest):
    if len(n) >1:
        df.loc[n, 'overlapIndex'] = "OverLap" +str(int(i+1))
        df.loc[n, 'overlap'] = True
    else:
        df.loc[n, 'overlap'] = False
    if len(n) >= 3:
        print(n)
        df.loc[n, 'Consecutive'] = True
    else:
        df.loc[n, 'Consecutive'] = False
        
print(df)

开始结束比较

start end comparison

python

pandas

loops

python-datetime

样本数据 4