获取数据框一列的第一个和最后一个值尊重另一列

Get the first and last value of a column of dataframe respect another column

我是 python 的初学者,我希望获取列日期的第一个和最后一个值始终使 mac_address 相同,例如:

我已经在 mac_address 之前订购了我的数据框,日期为下一行:

df = df.sort_values(by=['mac_address', 'date'], ascending=(True, True)) 

数据为:

         router        mac_address      date
589455  15001391    00:00:34:1a:03:e8   2021-01-01 22:09:34
590067  17091211    00:00:34:1a:03:e8   2021-01-01 22:10:54
590136  17091236    00:00:34:1a:03:e8   2021-01-01 22:11:04
.....
.....
.....
635434  15001391    00:00:78:01:0d:11   2021-01-02 00:14:54
636479  17091211    00:00:78:01:0d:11   2021-01-02 00:16:17
949873  17091172    00:00:af:82:56:93   2021-01-02 11:26:39
950699  17091251    00:00:af:82:56:93   2021-01-02 11:27:59
950700  17091253    00:00:af:82:56:93   2021-01-02 11:28:59
950702  17091257    00:00:af:82:56:93   2021-01-02 11:29:59
950703  17091258    00:00:af:82:56:93   2021-01-02 11:30:59
619384  17091174    00:01:09:d2:09:e0   2021-01-01 23:34:32
365351  17091211    00:01:d2:7c:4e:32   2021-01-01 14:27:58
109858  17091236    00:02:75:86:4e:34   2021-01-01 05:50:47
110281  17091211    00:02:75:86:4e:34   2021-01-01 05:50:54

注意:日期列的格式为“2021-01-01 05:50:54”,不同的 mac 地址出现的次数是可变的

我想要这样的两个输出:

第一个输出:

    589455  15001391    00:00:34:1a:03:e8   2021-01-01 22:09:34
    590136  17091236    00:00:34:1a:03:e8   2021-01-01 22:11:04
    635434  15001391    00:00:78:01:0d:11   2021-01-02 00:14:54
    636479  17091211    00:00:78:01:0d:11   2021-01-02 00:16:17
    .....
    .....
    949873  17091172    00:00:af:82:56:93   2021-01-02 11:26:39
    950703  17091258    00:00:af:82:56:93   2021-01-02 11:30:59
    619384  17091174    00:01:09:d2:09:e0   2021-01-01 23:34:32
    365351  17091211    00:01:d2:7c:4e:32   2021-01-01 14:27:58

second output: 只考虑了有第一个和最后一个值的数据,而不考虑只出现一次的mac_adress

    589455  15001391    00:00:34:1a:03:e8   22:09:34
    590136  17091236    00:00:34:1a:03:e8   22:11:04
    635434  15001391    00:00:78:01:0d:11   00:14:54
    636479  17091211    00:00:78:01:0d:11   00:16:17
    .....
    .....
    949873  17091172    00:00:af:82:56:93   11:26:39
    950703  17091258    00:00:af:82:56:93   11:30:59

我不知道我是在复杂化还是这个任务比我看到的要容易,但我已经过去 48 小时了,没有任何有利的结果。你能帮我吗?非常感谢

对于第一个输出,您可以 .groupby 在 mac_address 上,然后保留“第一个”、“最后一个”:

x = (
    df.groupby("mac_address")
    .agg(["first", "last"])
    .stack()
    .reset_index()
    .drop(columns="level_1")
)

print(x.drop_duplicates(keep="first"))

打印:

          mac_address    router                date
0   00:00:34:1a:03:e8  15001391 2021-01-01 22:09:34
1   00:00:34:1a:03:e8  17091236 2021-01-01 22:11:04
2   00:00:78:01:0d:11  15001391 2021-01-02 00:14:54
3   00:00:78:01:0d:11  17091211 2021-01-02 00:16:17
4   00:00:af:82:56:93  17091172 2021-01-02 11:26:39
5   00:00:af:82:56:93  17091258 2021-01-02 11:30:59
6   00:01:09:d2:09:e0  17091174 2021-01-01 23:34:32
8   00:01:d2:7c:4e:32  17091211 2021-01-01 14:27:58
10  00:02:75:86:4e:34  17091236 2021-01-01 05:50:47
11  00:02:75:86:4e:34  17091211 2021-01-01 05:50:54

对于第二个输出,只需删除所有重复项:

print(x.drop_duplicates(keep=False))

打印:

          mac_address    router                date
0   00:00:34:1a:03:e8  15001391 2021-01-01 22:09:34
1   00:00:34:1a:03:e8  17091236 2021-01-01 22:11:04
2   00:00:78:01:0d:11  15001391 2021-01-02 00:14:54
3   00:00:78:01:0d:11  17091211 2021-01-02 00:16:17
4   00:00:af:82:56:93  17091172 2021-01-02 11:26:39
5   00:00:af:82:56:93  17091258 2021-01-02 11:30:59
10  00:02:75:86:4e:34  17091236 2021-01-01 05:50:47
11  00:02:75:86:4e:34  17091211 2021-01-01 05:50:54

由于您的数据已经按 mac 地址和日期排序,因此您不需要使用 groupby

df1 = df.loc[(df['mac_address'].ne(df['mac_address'].shift())) | 
             (df['mac_address'].ne(df['mac_address'].shift(-1)))]

第一个输出:

>>> df1
          router        mac_address                 date
589455  15001391  00:00:34:1a:03:e8  2021-01-01 22:09:34
590136  17091236  00:00:34:1a:03:e8  2021-01-01 22:11:04
635434  15001391  00:00:78:01:0d:11  2021-01-02 00:14:54
636479  17091211  00:00:78:01:0d:11  2021-01-02 00:16:17
949873  17091172  00:00:af:82:56:93  2021-01-02 11:26:39
950703  17091258  00:00:af:82:56:93  2021-01-02 11:30:59
619384  17091174  00:01:09:d2:09:e0  2021-01-01 23:34:32
365351  17091211  00:01:d2:7c:4e:32  2021-01-01 14:27:58
109858  17091236  00:02:75:86:4e:34  2021-01-01 05:50:47
110281  17091211  00:02:75:86:4e:34  2021-01-01 05:50:54

第二个输出:

>>> df1.loc[df1.duplicated('mac_address', keep=False)]
          router        mac_address                 date
589455  15001391  00:00:34:1a:03:e8  2021-01-01 22:09:34
590136  17091236  00:00:34:1a:03:e8  2021-01-01 22:11:04
635434  15001391  00:00:78:01:0d:11  2021-01-02 00:14:54
636479  17091211  00:00:78:01:0d:11  2021-01-02 00:16:17
949873  17091172  00:00:af:82:56:93  2021-01-02 11:26:39
950703  17091258  00:00:af:82:56:93  2021-01-02 11:30:59
109858  17091236  00:02:75:86:4e:34  2021-01-01 05:50:47
110281  17091211  00:02:75:86:4e:34  2021-01-01 05:50:54