Pandas 删除带有时间戳的重复子集
Pandas drop duplicates subset with timestamp
我试图按子集删除重复项,但无论我怎么做,结果总是一样的——没有任何变化。帮助我了解我做错了什么。代码:
import pandas as pd
test_df = pd.DataFrame(
{
'city': ['Cincinnati', 'San Francisco', 'Chicago', 'Chicago', 'Chicago', 'Chigaco'],
'timestamp': ['2014-03-01 00:01:00', '2014-05-01 09:11:00', '2014-01-01 15:22:00', '2014-01-01 15:59:00', '2014-01-01 23:01:00', '2014-01-01 23:01:00']
}
)
test_df = test_df.astype({'timestamp':'datetime64[ns]'})
test_df = test_df.sort_values(by=['city', 'timestamp'], ascending=False)
test_df = test_df.drop_duplicates(subset=['city', 'timestamp'], keep="first")
print(test_df)
# What I get:
# city timestamp
# 1 San Francisco 2014-05-01 09:11:00
# 0 Cincinnati 2014-03-01 00:01:00
# 5 Chigaco 2014-01-01 23:01:00
# 4 Chicago 2014-01-01 23:01:00
# 3 Chicago 2014-01-01 15:59:00
# 2 Chicago 2014-01-01 15:22:00
# Expected result:
# city timestamp
# 1 San Francisco 2014-05-01 09:11:00
# 0 Cincinnati 2014-03-01 00:01:00
# 5 Chigaco 2014-01-01 23:01:00
# 3 Chicago 2014-01-01 15:59:00
# 2 Chicago 2014-01-01 15:22:00
import pandas as pd
test_df = pd.DataFrame(
{
'city': ['Cincinnati', 'San Francisco', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
'timestamp': ['2014-03-01 00:01:00', '2014-05-01 09:11:00', '2014-01-01 15:22:00', '2014-01-01 15:59:00', '2014-01-01 23:01:00', '2014-01-01 23:01:00']
}
)
test_df = test_df.astype({'timestamp':'datetime64[ns]'})
test_df = test_df.sort_values(by=['city', 'timestamp'], ascending=False)
test_df = test_df.drop_duplicates(subset=['city', 'timestamp'], keep="first")
print(test_df)
您在 chicago 和 chigaco 的数据中有误
这是结果
city timestamp
1 San Francisco 2014-05-01 09:11:00
0 Cincinnati 2014-03-01 00:01:00
4 Chicago 2014-01-01 23:01:00
3 Chicago 2014-01-01 15:59:00
2 Chicago 2014-01-01 15:22:00
这与其他一些答案一样有效:
test_df = pd.DataFrame(
{
'city': ['Cincinnati', 'San Francisco', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
'timestamp': ['2014-03-01 00:01:00', '2014-05-01 09:11:00', '2014-01-01 15:22:00', '2014-01-01 15:59:00', '2014-01-01 23:01:00', '2014-01-01 23:01:00']
}
)
test_df = test_df.astype({'timestamp':'datetime64[ns]'})
test_df['Check'] = test_df.sort_values(['city', 'timestamp'], ascending=[True, True]).groupby(['city', 'timestamp']).cumcount() + 1
test_df.loc[test_df['Check'] < 2]
test_df = test_df[['city', 'timestamp']]
test_df
我试图按子集删除重复项,但无论我怎么做,结果总是一样的——没有任何变化。帮助我了解我做错了什么。代码:
import pandas as pd
test_df = pd.DataFrame(
{
'city': ['Cincinnati', 'San Francisco', 'Chicago', 'Chicago', 'Chicago', 'Chigaco'],
'timestamp': ['2014-03-01 00:01:00', '2014-05-01 09:11:00', '2014-01-01 15:22:00', '2014-01-01 15:59:00', '2014-01-01 23:01:00', '2014-01-01 23:01:00']
}
)
test_df = test_df.astype({'timestamp':'datetime64[ns]'})
test_df = test_df.sort_values(by=['city', 'timestamp'], ascending=False)
test_df = test_df.drop_duplicates(subset=['city', 'timestamp'], keep="first")
print(test_df)
# What I get:
# city timestamp
# 1 San Francisco 2014-05-01 09:11:00
# 0 Cincinnati 2014-03-01 00:01:00
# 5 Chigaco 2014-01-01 23:01:00
# 4 Chicago 2014-01-01 23:01:00
# 3 Chicago 2014-01-01 15:59:00
# 2 Chicago 2014-01-01 15:22:00
# Expected result:
# city timestamp
# 1 San Francisco 2014-05-01 09:11:00
# 0 Cincinnati 2014-03-01 00:01:00
# 5 Chigaco 2014-01-01 23:01:00
# 3 Chicago 2014-01-01 15:59:00
# 2 Chicago 2014-01-01 15:22:00
import pandas as pd
test_df = pd.DataFrame(
{
'city': ['Cincinnati', 'San Francisco', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
'timestamp': ['2014-03-01 00:01:00', '2014-05-01 09:11:00', '2014-01-01 15:22:00', '2014-01-01 15:59:00', '2014-01-01 23:01:00', '2014-01-01 23:01:00']
}
)
test_df = test_df.astype({'timestamp':'datetime64[ns]'})
test_df = test_df.sort_values(by=['city', 'timestamp'], ascending=False)
test_df = test_df.drop_duplicates(subset=['city', 'timestamp'], keep="first")
print(test_df)
您在 chicago 和 chigaco 的数据中有误
这是结果
city timestamp
1 San Francisco 2014-05-01 09:11:00
0 Cincinnati 2014-03-01 00:01:00
4 Chicago 2014-01-01 23:01:00
3 Chicago 2014-01-01 15:59:00
2 Chicago 2014-01-01 15:22:00
这与其他一些答案一样有效:
test_df = pd.DataFrame(
{
'city': ['Cincinnati', 'San Francisco', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
'timestamp': ['2014-03-01 00:01:00', '2014-05-01 09:11:00', '2014-01-01 15:22:00', '2014-01-01 15:59:00', '2014-01-01 23:01:00', '2014-01-01 23:01:00']
}
)
test_df = test_df.astype({'timestamp':'datetime64[ns]'})
test_df['Check'] = test_df.sort_values(['city', 'timestamp'], ascending=[True, True]).groupby(['city', 'timestamp']).cumcount() + 1
test_df.loc[test_df['Check'] < 2]
test_df = test_df[['city', 'timestamp']]
test_df