如何创建一种算法来帮助我改进结果和自动化流程？

Question

我已经 post 在这里解决了我的问题，从那时起我一直在努力寻找解决方案来帮助我优化我的结果，在之前的 post 中，Yaloa 了解我想要什么可悲的是我总是陷入死胡同 My previous Post

事实是我想改进我的结果以便将它们可视化这是我的数据框：

ID           TimeandDate        Date       Time
10   2020-08-07 07:40:09  2022-08-07   07:40:09
10   2020-08-07 08:50:00  2022-08-07   08:50:00
10   2020-08-07 12:40:09  2022-08-07   12:40:09
10   2020-08-08 07:40:09  2022-08-08   07:40:09
10   2020-08-08 17:40:09  2022-08-08   17:40:09
12   2020-08-07 08:03:09  2022-08-07   08:03:09
12   2020-08-07 10:40:09  2022-08-07   10:40:09
12   2020-08-07 14:40:09  2022-08-07   14:40:09
12   2020-08-07 16:40:09  2022-08-07   16:40:09
13   2020-08-07 09:22:45  2022-08-07   09:22:45
13   2020-08-07 17:57:06  2022-08-07   17:57:06

首先，所有数据都是从时钟收集的，我想创建包含 2 个新列的新数据框，第一个是 df["Check-in"]，如您所见，我的数据没有任何指示器来显示时间id 已经签入，所以我假设每个 id 的第一次是 check-in，下一行是签出，它将被插入 df["Check-out"] ，如果 check-in 没有 check-out 时间，则必须将其注册为同一天前一个 check-out 的 check-out（有时 id 忘记了 check-out) 因为 check-in 和 check-out 的行数必须相同，不能有 2 check-ins 和 3 check-outs

我尝试了什么？ 我的意思是我需要更好的结果是因为我尝试的不是最好的解决方案，我将 min 作为 check-in 而 max 是 check-out每个 id 的 time 没有添加两列，然后我开始计算时差，现在想象 ID=13 在 07:40:09 进入并且他在08:40:09 ，那天晚些时候他 returns 在 19:20:00 并在接下来的 10 分钟内离开 19:30:00 如果我这样做，它将表明他工作了 12 个小时，而他真正的工作时间为1小时

想要的结果

ID         Date   Check-in    Check-out
10   2020-08-07   07:40:09     12:40:09
10   2020-08-08   07:40:09     17:40:09
12   2020-08-07   08:03:09     10:40:09
12   2020-08-07   14:40:09     16:40:09 
13   2020-08-07   09:22:45     17:57:06

提前致谢

Answer 1

我花了一段时间才正确理解你的问题。一种方法是将 df 按 ID 和 Date 分组并检查 sub-df 的行数。如果行数为奇数，则删除 next-to-last 行。最后创建您的签入和签出列（ffill checkin 和 dropna 以删除重复条目）：

df = df.drop('TimeandDate', axis=1)
df_output = pd.DataFrame()

for (id, date), subdf in df.groupby(['ID', 'Date']):
    subdf = subdf.reset_index(drop=True)

    # handling case where more checkins than checkouts
    nb_checks = len(subdf.index)
    if nb_checks % 2 and nb_checks > 1:
        subdf.iloc[-2, :] = subdf.iloc[-1, :]
        subdf = subdf.head(-1)
    
    subdf['Check-in'] = subdf.loc[::2, 'Time']
    subdf['Check-out'] = subdf.loc[1::2, 'Time']
    subdf['Check-in'].ffill(inplace=True)

    df_output = df_output.append(subdf.drop('Time', axis=1).dropna())

print(df_output.reset_index(drop=True))

输出：

   ID        Date  Check-in Check-out
0  10  2022-08-07  07:40:09  12:40:09
1  10  2022-08-08  07:40:09  17:40:09
2  12  2022-08-07  08:03:09  10:40:09
3  12  2022-08-07  14:40:09  16:40:09
4  13  2022-08-07  09:22:45  17:57:06

Edit/Note：如果给定的 ID/Date 只有一个条目，它不会出现在您的最终结果中。（你必须单独处理 nb_checks==1 的情况）

Answer 2

这是一个解决方案，虽然不如 Tranbi 的高效，但它可以通过使签入和签出时间相同来处理 1 行的情况。

df = df.drop(["TimeandDate"], axis=1)
df = df.sort_values(by=["ID", "Date"], axis=0)

unique_ids = df["ID"].unique()

# Rows will eventually become the final df.
rows = []

for id in unique_ids:
    id_df = df[df["ID"] == id]
    unique_dates = id_df["Date"].unique()
    print(unique_dates)

    for date in unique_dates:
        id_date_df = id_df[id_df["Date"] == date]
        length = len(id_date_df)

        # Case where there are an even number of rows for that ID and date combination.
        if length % 2 == 0:
            for i in range(0, length, 2):
                checkin_time = id_date_df["Time"].iloc[i]
                checkout_time = id_date_df["Time"].iloc[i + 1]

                row = [id, date, checkin_time, checkout_time]
                rows.append(row)

        # Odd number of rows, more than 2 rows.
        elif length > 2:
            for i in range(0, length-3, 2):
                checkin_time = id_date_df.iloc[i]
                checkout_time = id_date_df.iloc[i + 1]

                row = [id, date, checkin_time, checkout_time]
                rows.append(row)

            # The last row checkin-checkout combo comes from the 3rd-from-last and last rows.
            rows.append([id, date, id_date_df["Time"].iloc[length-3], id_date_df["Time"].iloc[length-1]])

        # 1 row only. Dealt with by making checkin and checkout times the same.
        else:
            rows.append([id, date, id_date_df["Time"].iloc[0], id_date_df["Time"].iloc[0]])

# The final dataframe.
df_fin = pd.DataFrame(rows, columns=["ID", "Date", "CheckinTime", "CheckoutTime"])

很有可能，您可以像 Tranbi 的回答中那样用 groupby 替换一些 for 循环，以提高效率。

Answer 3

假设您的初始数据帧是 df:

按“ID”和“日期”对数据进行分组，然后取偶数索引值（即 0、2、...）。将其转换为数据框 .to_frame()，重命名该列，并删除不需要的索引列（初始索引）。

df2 = df.groupby(["ID", "Date"]).apply(lambda x: x["Time"].iloc[::2]).to_frame().rename(columns={"Time": "Check-in"}).droplevel(2)
#Out: 
#               Check-in
#ID Date                
#10 2022-08-07  07:40:09
#   2022-08-07  12:40:09
#   2022-08-08  07:40:09
#12 2022-08-07  08:03:09
#   2022-08-07  14:40:09
#13 2022-08-07  09:22:45

创建第二个索引，使用奇数索引值，然后合并。

df2 = df2.merge(df.groupby(["ID", "Date"]).apply(lambda x: x["Time"].iloc[1::2]).to_frame().rename(columns={"Time": "Check-out"}).droplevel(2),
                left_index=True, right_index=True, how="left")
#Out: 
#               Check-in Check-out
#ID Date                          
#10 2022-08-07  07:40:09  08:50:00
#   2022-08-07  12:40:09  08:50:00
#   2022-08-08  07:40:09  17:40:09
#12 2022-08-07  08:03:09  10:40:09
#   2022-08-07  08:03:09  16:40:09
#   2022-08-07  14:40:09  10:40:09
#   2022-08-07  14:40:09  16:40:09
#13 2022-08-07  09:22:45  17:57:06

将大于check-in的值check-out（进出数量不相等时出现）更改为check-in时间（这样shift 的工作时间给定时间 0).

df2["Check-out"] = np.where(df2["Check-out"] < df2["Check-in"], df2["Check-in"], df2["Check-out"])
#Out: 
#               Check-in Check-out
#ID Date                          
#10 2022-08-07  07:40:09  08:50:00
#   2022-08-07  12:40:09  12:40:09
#   2022-08-08  07:40:09  17:40:09
#12 2022-08-07  08:03:09  10:40:09
#   2022-08-07  08:03:09  16:40:09
#   2022-08-07  14:40:09  14:40:09
#   2022-08-07  14:40:09  16:40:09
#13 2022-08-07  09:22:45  17:57:06

计算 check-in 和 check-out 之间的差值。

df2["Time_in"] = pd.to_datetime(df2["Check-out"], format='%H:%M:%S') - pd.to_datetime(df2["Check-in"], format='%H:%M:%S')

对每个 ID 每天求和，得出总工作时数。

logged_hours = df2.groupby(["ID", "Date"])["Time_in"].sum()

#Out: 
#ID  Date      
#10  2022-08-07   0 days 01:09:51
#    2022-08-08   0 days 10:00:00
#12  2022-08-07   0 days 13:14:00
#13  2022-08-07   0 days 08:34:21
#Name: Time_in, dtype: timedelta64[ns]

如何创建一种算法来帮助我改进结果和自动化流程？

How to create an algorithm that helps me improve results and automate process?

python

algorithm

dataframe

pandas

data-science