获取数据框一列的第一个和最后一个值尊重另一列
Get the first and last value of a column of dataframe respect another column
我是 python 的初学者,我希望获取列日期的第一个和最后一个值始终使 mac_address 相同,例如:
我已经在 mac_address 之前订购了我的数据框,日期为下一行:
df = df.sort_values(by=['mac_address', 'date'], ascending=(True, True))
数据为:
router mac_address date
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590067 17091211 00:00:34:1a:03:e8 2021-01-01 22:10:54
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
.....
.....
.....
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950699 17091251 00:00:af:82:56:93 2021-01-02 11:27:59
950700 17091253 00:00:af:82:56:93 2021-01-02 11:28:59
950702 17091257 00:00:af:82:56:93 2021-01-02 11:29:59
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
619384 17091174 00:01:09:d2:09:e0 2021-01-01 23:34:32
365351 17091211 00:01:d2:7c:4e:32 2021-01-01 14:27:58
109858 17091236 00:02:75:86:4e:34 2021-01-01 05:50:47
110281 17091211 00:02:75:86:4e:34 2021-01-01 05:50:54
注意:日期列的格式为“2021-01-01 05:50:54”,不同的 mac 地址出现的次数是可变的
我想要这样的两个输出:
第一个输出:
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
.....
.....
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
619384 17091174 00:01:09:d2:09:e0 2021-01-01 23:34:32
365351 17091211 00:01:d2:7c:4e:32 2021-01-01 14:27:58
second output: 只考虑了有第一个和最后一个值的数据,而不考虑只出现一次的mac_adress
589455 15001391 00:00:34:1a:03:e8 22:09:34
590136 17091236 00:00:34:1a:03:e8 22:11:04
635434 15001391 00:00:78:01:0d:11 00:14:54
636479 17091211 00:00:78:01:0d:11 00:16:17
.....
.....
949873 17091172 00:00:af:82:56:93 11:26:39
950703 17091258 00:00:af:82:56:93 11:30:59
我不知道我是在复杂化还是这个任务比我看到的要容易,但我已经过去 48 小时了,没有任何有利的结果。你能帮我吗?非常感谢
对于第一个输出,您可以 .groupby
在 mac_address 上,然后保留“第一个”、“最后一个”:
x = (
df.groupby("mac_address")
.agg(["first", "last"])
.stack()
.reset_index()
.drop(columns="level_1")
)
print(x.drop_duplicates(keep="first"))
打印:
mac_address router date
0 00:00:34:1a:03:e8 15001391 2021-01-01 22:09:34
1 00:00:34:1a:03:e8 17091236 2021-01-01 22:11:04
2 00:00:78:01:0d:11 15001391 2021-01-02 00:14:54
3 00:00:78:01:0d:11 17091211 2021-01-02 00:16:17
4 00:00:af:82:56:93 17091172 2021-01-02 11:26:39
5 00:00:af:82:56:93 17091258 2021-01-02 11:30:59
6 00:01:09:d2:09:e0 17091174 2021-01-01 23:34:32
8 00:01:d2:7c:4e:32 17091211 2021-01-01 14:27:58
10 00:02:75:86:4e:34 17091236 2021-01-01 05:50:47
11 00:02:75:86:4e:34 17091211 2021-01-01 05:50:54
对于第二个输出,只需删除所有重复项:
print(x.drop_duplicates(keep=False))
打印:
mac_address router date
0 00:00:34:1a:03:e8 15001391 2021-01-01 22:09:34
1 00:00:34:1a:03:e8 17091236 2021-01-01 22:11:04
2 00:00:78:01:0d:11 15001391 2021-01-02 00:14:54
3 00:00:78:01:0d:11 17091211 2021-01-02 00:16:17
4 00:00:af:82:56:93 17091172 2021-01-02 11:26:39
5 00:00:af:82:56:93 17091258 2021-01-02 11:30:59
10 00:02:75:86:4e:34 17091236 2021-01-01 05:50:47
11 00:02:75:86:4e:34 17091211 2021-01-01 05:50:54
由于您的数据已经按 mac 地址和日期排序,因此您不需要使用 groupby
。
df1 = df.loc[(df['mac_address'].ne(df['mac_address'].shift())) |
(df['mac_address'].ne(df['mac_address'].shift(-1)))]
第一个输出:
>>> df1
router mac_address date
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
619384 17091174 00:01:09:d2:09:e0 2021-01-01 23:34:32
365351 17091211 00:01:d2:7c:4e:32 2021-01-01 14:27:58
109858 17091236 00:02:75:86:4e:34 2021-01-01 05:50:47
110281 17091211 00:02:75:86:4e:34 2021-01-01 05:50:54
第二个输出:
>>> df1.loc[df1.duplicated('mac_address', keep=False)]
router mac_address date
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
109858 17091236 00:02:75:86:4e:34 2021-01-01 05:50:47
110281 17091211 00:02:75:86:4e:34 2021-01-01 05:50:54
我是 python 的初学者,我希望获取列日期的第一个和最后一个值始终使 mac_address 相同,例如:
我已经在 mac_address 之前订购了我的数据框,日期为下一行:
df = df.sort_values(by=['mac_address', 'date'], ascending=(True, True))
数据为:
router mac_address date
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590067 17091211 00:00:34:1a:03:e8 2021-01-01 22:10:54
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
.....
.....
.....
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950699 17091251 00:00:af:82:56:93 2021-01-02 11:27:59
950700 17091253 00:00:af:82:56:93 2021-01-02 11:28:59
950702 17091257 00:00:af:82:56:93 2021-01-02 11:29:59
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
619384 17091174 00:01:09:d2:09:e0 2021-01-01 23:34:32
365351 17091211 00:01:d2:7c:4e:32 2021-01-01 14:27:58
109858 17091236 00:02:75:86:4e:34 2021-01-01 05:50:47
110281 17091211 00:02:75:86:4e:34 2021-01-01 05:50:54
注意:日期列的格式为“2021-01-01 05:50:54”,不同的 mac 地址出现的次数是可变的
我想要这样的两个输出:
第一个输出:
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
.....
.....
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
619384 17091174 00:01:09:d2:09:e0 2021-01-01 23:34:32
365351 17091211 00:01:d2:7c:4e:32 2021-01-01 14:27:58
second output: 只考虑了有第一个和最后一个值的数据,而不考虑只出现一次的mac_adress
589455 15001391 00:00:34:1a:03:e8 22:09:34
590136 17091236 00:00:34:1a:03:e8 22:11:04
635434 15001391 00:00:78:01:0d:11 00:14:54
636479 17091211 00:00:78:01:0d:11 00:16:17
.....
.....
949873 17091172 00:00:af:82:56:93 11:26:39
950703 17091258 00:00:af:82:56:93 11:30:59
我不知道我是在复杂化还是这个任务比我看到的要容易,但我已经过去 48 小时了,没有任何有利的结果。你能帮我吗?非常感谢
对于第一个输出,您可以 .groupby
在 mac_address 上,然后保留“第一个”、“最后一个”:
x = (
df.groupby("mac_address")
.agg(["first", "last"])
.stack()
.reset_index()
.drop(columns="level_1")
)
print(x.drop_duplicates(keep="first"))
打印:
mac_address router date
0 00:00:34:1a:03:e8 15001391 2021-01-01 22:09:34
1 00:00:34:1a:03:e8 17091236 2021-01-01 22:11:04
2 00:00:78:01:0d:11 15001391 2021-01-02 00:14:54
3 00:00:78:01:0d:11 17091211 2021-01-02 00:16:17
4 00:00:af:82:56:93 17091172 2021-01-02 11:26:39
5 00:00:af:82:56:93 17091258 2021-01-02 11:30:59
6 00:01:09:d2:09:e0 17091174 2021-01-01 23:34:32
8 00:01:d2:7c:4e:32 17091211 2021-01-01 14:27:58
10 00:02:75:86:4e:34 17091236 2021-01-01 05:50:47
11 00:02:75:86:4e:34 17091211 2021-01-01 05:50:54
对于第二个输出,只需删除所有重复项:
print(x.drop_duplicates(keep=False))
打印:
mac_address router date
0 00:00:34:1a:03:e8 15001391 2021-01-01 22:09:34
1 00:00:34:1a:03:e8 17091236 2021-01-01 22:11:04
2 00:00:78:01:0d:11 15001391 2021-01-02 00:14:54
3 00:00:78:01:0d:11 17091211 2021-01-02 00:16:17
4 00:00:af:82:56:93 17091172 2021-01-02 11:26:39
5 00:00:af:82:56:93 17091258 2021-01-02 11:30:59
10 00:02:75:86:4e:34 17091236 2021-01-01 05:50:47
11 00:02:75:86:4e:34 17091211 2021-01-01 05:50:54
由于您的数据已经按 mac 地址和日期排序,因此您不需要使用 groupby
。
df1 = df.loc[(df['mac_address'].ne(df['mac_address'].shift())) |
(df['mac_address'].ne(df['mac_address'].shift(-1)))]
第一个输出:
>>> df1
router mac_address date
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
619384 17091174 00:01:09:d2:09:e0 2021-01-01 23:34:32
365351 17091211 00:01:d2:7c:4e:32 2021-01-01 14:27:58
109858 17091236 00:02:75:86:4e:34 2021-01-01 05:50:47
110281 17091211 00:02:75:86:4e:34 2021-01-01 05:50:54
第二个输出:
>>> df1.loc[df1.duplicated('mac_address', keep=False)]
router mac_address date
589455 15001391 00:00:34:1a:03:e8 2021-01-01 22:09:34
590136 17091236 00:00:34:1a:03:e8 2021-01-01 22:11:04
635434 15001391 00:00:78:01:0d:11 2021-01-02 00:14:54
636479 17091211 00:00:78:01:0d:11 2021-01-02 00:16:17
949873 17091172 00:00:af:82:56:93 2021-01-02 11:26:39
950703 17091258 00:00:af:82:56:93 2021-01-02 11:30:59
109858 17091236 00:02:75:86:4e:34 2021-01-01 05:50:47
110281 17091211 00:02:75:86:4e:34 2021-01-01 05:50:54