pandas 基于多列的groupby，但保留其他列上重复次数最多的

Question

我有一个 table 下面的样子。

	lon	lat	output
4050	-47.812224	-19.043365	1890.283215
5149	-47.812224	-19.043365	1890.283215
7316	-47.812224	-19.043365	1890.283215
8406	-47.812224	-19.043365	1890.283215
511	-47.812014	-19.007094	1813.785728
1555	-47.812014	-19.007094	1813.785728
3764	-47.812014	-19.007094	1821.363582
4846	-47.812014	-19.007094	1813.785728
29	-47.811177	-19.008053	1763.091936
1114	-47.811177	-19.008053	1763.091936
3262	-47.811177	-19.008053	1763.091936
4357	-47.811177	-19.008053	1763.091936
1436	-47.774424	-19.008700	2172.781911
2557	-47.774424	-19.008700	2174.394848
4725	-47.774424	-19.008700	2172.781911
5840	-47.774424	-19.008700	2172.781911
5211	-47.774166	-19.043847	2897.092502
6313	-47.774166	-19.043847	2897.092502
8460	-47.774166	-19.043847	2897.092502
9543	-47.774166	-19.043847	2897.092502
1679	-47.773958	-19.007574	2179.670924
2770	-47.773958	-19.007574	2179.670924
4998	-47.773958	-19.007574	2179.670924
6088	-47.773958	-19.007574	2179.670924
1937	-47.773121	-19.008533	2236.769862
3004	-47.773121	-19.008533	2236.769862
5231	-47.773121	-19.008533	2236.769862
6332	-47.773121	-19.008533	2236.769862

我想通过在 lon 和 lat 上使用 groupby 来删除重复项，但在 [=25= 上保留重复次数最多的值]输出

例如

lon	lat	output
-47.812224	-19.043365	1890.283215
-47.812014	-19.007094	1813.785728
-47.811177	-19.008053	1763.091936
-47.774424	-19.008700	2172.781911
-47.774166	-19.043847	2897.092502
-47.773958	-19.007574	2179.670924
-47.773121	-19.008533	2236.769862

谁能告诉我怎么做？

Answer 1

您可以将 .groupby 与 Series.mode 结合使用：

x = df.groupby(["lon", "lat"])["output"].apply(lambda x: x.mode()[0])
print(x.reset_index())

打印：

         lon        lat       output
0 -47.812224 -19.043365  1890.283215
1 -47.812014 -19.007094  1813.785728
2 -47.811177 -19.008053  1763.091936
3 -47.774424 -19.008700  2172.781911
4 -47.774166 -19.043847  2897.092502
5 -47.773958 -19.007574  2179.670924
6 -47.773121 -19.008533  2236.769862

Answer 2

我们可以使用 .groupby 聚合方法来替代 Andrej 使用 .apply 并计算每一行的方法。

虽然它确实解决了我们的问题，但由于缺乏矢量化，.apply 方法对于大型数据集往往会变慢。

此外，reset_index 与 'inplace=True' 一起使用时速度更快。

%%timeit
df.groupby(['lat', 'lon']).agg(pd.Series.mode).reset_index(inplace=True)

希望对您有所帮助！！

pandas 基于多列的groupby，但保留其他列上重复次数最多的

pandas groupby based on multi-columns, but keep the most repeated duplicates number on other column

duplicates

pandas

pandas-groupby