如何从重复项中保留特定的重复项?
How to keep a specific duplicate from among duplicates?
我有一个 .csv 文件。
time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA
很多行,例如:
2022-04-08T02:00:00+02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,
并且有重复项,例如其中的 4 个,在“极值”列中有所不同
像这样:
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
它按 'time' 排序,轴=0(它的 A 列,计算中的第 0 列 sheet)
csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')
一次17:10:25被骗了4个,怎么扔掉不匹配的?
我们这里有:41.589、43.6、43.6、43.6。
41.589错误,需要out,只需要保留1份剩余的3个dupes(drop.duplicated可以做到,但不能给我4个dupes来处理,只能设置在3 种方式:keep='first'、keep='last' 或 keep=False,我不需要存在 keep=True.. 我需要所有 4 个骗子中的 return,来检查哪个是4 个中的 1 个不好,在我 unique_seen 全部之前,只减少到 1,在这种情况下正确 43.6。有人知道如何实现吗?
在堆栈中看到了一些想法,但对它们的理解不足以应用到我的案例中,所以我请求帮助。
您可以使用两种不同的模式 duplicated
两次:keep=False
和您选择的另一种模式。然后从这两个计算出一个布尔掩码用于切片。
假设这个示例数据集:
date col other
0 a a 0
1 a a 1
2 a X 2 # unique
3 a a 3
4 b Y 4 # unique
5 b b 5
6 b b 6
7 b b 7
您可以使用:
m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)
df2 = df[m1!=m2]
输出:
date col other
0 a a 0
5 b b 5
中间体:
date col other m1 m2 m1!=m2
0 a a 0 False True True
1 a a 1 True True False
2 a X 2 False False False
3 a a 3 True True False
4 b Y 4 False False False
5 b b 5 False True True
6 b b 6 True True False
7 b b 7 True True False
我有一个 .csv 文件。
time,open,high,low,close,Extremum,Fib 1,Fib 2,Fib 3,l100,LS3,SS3,Volume,Volume MA
很多行,例如:
2022-04-08T02:00:00+02:00,43.431,43.44,43.431,43.44,44.669,43.58332033414956,43.28818411430672,43.11250779297169,42.91223678664976,,,78.07,
并且有重复项,例如其中的 4 个,在“极值”列中有所不同 像这样:
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,41.589,42.64812186602502,42.93603848979882,43.10741743252131,43.30278942722496,,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
2022-04-07 17:10:25,41.622,41.625,41.622,41.625,43.6,42.38191401399852,42.05078384304666,41.85368255081341,41.6289870776675,41.007714285714286,,6.99,571.0029999999954
它按 'time' 排序,轴=0(它的 A 列,计算中的第 0 列 sheet)
csvData.sort_values(by=["time"],axis=0,ascending=True,inplace=True,na_position='first')
一次17:10:25被骗了4个,怎么扔掉不匹配的?
我们这里有:41.589、43.6、43.6、43.6。 41.589错误,需要out,只需要保留1份剩余的3个dupes(drop.duplicated可以做到,但不能给我4个dupes来处理,只能设置在3 种方式:keep='first'、keep='last' 或 keep=False,我不需要存在 keep=True.. 我需要所有 4 个骗子中的 return,来检查哪个是4 个中的 1 个不好,在我 unique_seen 全部之前,只减少到 1,在这种情况下正确 43.6。有人知道如何实现吗? 在堆栈中看到了一些想法,但对它们的理解不足以应用到我的案例中,所以我请求帮助。
您可以使用两种不同的模式 duplicated
两次:keep=False
和您选择的另一种模式。然后从这两个计算出一个布尔掩码用于切片。
假设这个示例数据集:
date col other
0 a a 0
1 a a 1
2 a X 2 # unique
3 a a 3
4 b Y 4 # unique
5 b b 5
6 b b 6
7 b b 7
您可以使用:
m1 = df.duplicated(subset=['date','col'])
m2 = df.duplicated(subset=['date','col'], keep=False)
df2 = df[m1!=m2]
输出:
date col other
0 a a 0
5 b b 5
中间体:
date col other m1 m2 m1!=m2
0 a a 0 False True True
1 a a 1 True True False
2 a X 2 False False False
3 a a 3 True True False
4 b Y 4 False False False
5 b b 5 False True True
6 b b 6 True True False
7 b b 7 True True False