如何删除其余的重复行,同时保留基于 A 列的第一行和最后一行?
How can I delete the rest duplicate rows while keeping the first and last row based on Column A?
如何删除其余的重复行,同时保留基于 A 列的第一行和最后一行?
df = pd.DataFrame({
'Column A': [12,12,12, 15, 16, 141, 141, 141, 141],
'Column B':['Apple' ,'Apple' ,'Apple' , 'Red', 'Blue', 'Yellow', 'Yellow', 'Yellow', 'Yellow'],
'Column C':[100, 50, np.nan , 23 , np.nan , 199 , np.nan , 1,np.nan]
})
或数据table如下:
| Column A | Column B |Column C
----| -------- | ---------|--------
0 | 12 | Apple |100
1 | 12 | Apple |50
2 | 12 | Apple |NaN
3 | 15 | Red |23
4 | 16 | Blue |NaN
5 | 141 | Yellow |199
6 | 141 | Yellow |NaN
7 | 141 | Yellow |1
8 | 141 | Yellow |NaN
结果将是:
| Column A | Column B |Column C
----| -------- | ---------|--------
0 | 12 | Apple |100
2 | 12 | Apple |NaN
3 | 15 | Red |23
4 | 16 | Blue |NaN
5 | 141 | Yellow |199
8 | 141 | Yellow |NaN
df.drop_duplicates(subset=['A'], keep='first')
你最后做同样的事情
df.drop_duplicates(subset=['A'], keep='last')
这是实现您想要的可能的方法:
result = (
pd.concat([
df.drop_duplicates('Column A', keep='first'),
df.drop_duplicates('Column A', keep='last'),
]).reset_index()
.drop_duplicates('index')
.sort_values('index')
.set_index('index')
.rename_axis(None)
)
结果:
Column A Column B Column C
0 12 Apple 100.0
2 12 Apple NaN
3 15 Red 23.0
4 16 Blue NaN
5 141 Yellow 199.0
8 141 Yellow NaN
一种选择是将 groupby
与 nth
函数一起使用:
df.groupby('Column A', sort = False, as_index = False).nth([0, -1])
Column A Column B Column C
0 12 Apple 100.0
2 12 Apple NaN
3 15 Red 23.0
4 16 Blue NaN
5 141 Yellow 199.0
8 141 Yellow NaN
如何删除其余的重复行,同时保留基于 A 列的第一行和最后一行?
df = pd.DataFrame({
'Column A': [12,12,12, 15, 16, 141, 141, 141, 141],
'Column B':['Apple' ,'Apple' ,'Apple' , 'Red', 'Blue', 'Yellow', 'Yellow', 'Yellow', 'Yellow'],
'Column C':[100, 50, np.nan , 23 , np.nan , 199 , np.nan , 1,np.nan]
})
或数据table如下:
| Column A | Column B |Column C
----| -------- | ---------|--------
0 | 12 | Apple |100
1 | 12 | Apple |50
2 | 12 | Apple |NaN
3 | 15 | Red |23
4 | 16 | Blue |NaN
5 | 141 | Yellow |199
6 | 141 | Yellow |NaN
7 | 141 | Yellow |1
8 | 141 | Yellow |NaN
结果将是:
| Column A | Column B |Column C
----| -------- | ---------|--------
0 | 12 | Apple |100
2 | 12 | Apple |NaN
3 | 15 | Red |23
4 | 16 | Blue |NaN
5 | 141 | Yellow |199
8 | 141 | Yellow |NaN
df.drop_duplicates(subset=['A'], keep='first')
你最后做同样的事情
df.drop_duplicates(subset=['A'], keep='last')
这是实现您想要的可能的方法:
result = (
pd.concat([
df.drop_duplicates('Column A', keep='first'),
df.drop_duplicates('Column A', keep='last'),
]).reset_index()
.drop_duplicates('index')
.sort_values('index')
.set_index('index')
.rename_axis(None)
)
结果:
Column A Column B Column C
0 12 Apple 100.0
2 12 Apple NaN
3 15 Red 23.0
4 16 Blue NaN
5 141 Yellow 199.0
8 141 Yellow NaN
一种选择是将 groupby
与 nth
函数一起使用:
df.groupby('Column A', sort = False, as_index = False).nth([0, -1])
Column A Column B Column C
0 12 Apple 100.0
2 12 Apple NaN
3 15 Red 23.0
4 16 Blue NaN
5 141 Yellow 199.0
8 141 Yellow NaN