pandas 如何根据条件删除重复的行

pandas how to drop duplicated rows based on conditions

我有一个df,你可以通过这个代码:

import numpy as np
import pandas as pd
from io import StringIO
dfs = """
    contract  RB RateCompany gs  IssueDate  ValIssueDate   ToDate1  FromDate1
84  GA16      77           T  G   19940701    19480101.0  197702.0   190001.0
85  GA16      77           T  G   19940701    19980101.0  999999.0   197703.0

"""
df = pd.read_csv(StringIO(dfs.strip()), sep='\s+', 
                  dtype={"RB": int}
                  )
df

输出:

contract    RB  RateCompany gs  IssueDate   ValIssueDate    ToDate1    FromDate1
84  GA16    77  T           G   19940701    19480101.0      197702.0    190001.0
85  GA16    77  T           G   19940701    19980101.0      999999.0    197703.0

对于这个df,一组有RB的合约是唯一的,也就是说只剩下1行,条件是:

FromDate1<=df.IssueDate<=ToDate1

所以我累了:

df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
                 (df['IssueDate'] <= df['ToDate1']) &
                 (df['IssueDate'] >= df['FromDate1']))]

但是输出是空白的:

contract    RB  RateCompany gs  IssueDate   ValIssueDate    ToDate1 FromDate1

预期的输出应该是:

contract    RB  RateCompany gs  IssueDate   ValIssueDate    ToDate1    FromDate1

85  GA16    77  T           G   19940701    19980101.0      999999.0    197703.0

有朋友可以帮忙吗?

将你的IssueDate除以100,可能是时间换算单位的问题:

>>> df.loc[df['IssueDate'].div(100).between(df['FromDate1'], df['ToDate1'])]
   contract  RB RateCompany gs  IssueDate  ValIssueDate   ToDate1  FromDate1
85     GA16  77           T  G   19940701    19980101.0  999999.0   197703.0

检查一下:

>>> df.loc[85, ['IssueDate', 'FromDate1', 'ToDate1']].astype(int)

IssueDate    19940701
FromDate1      197703
ToDate1        999999
Name: 85, dtype: int64

让我们先讨论你的答案

df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
                 (df['IssueDate'] <= df['ToDate1']) &
                 (df['IssueDate'] >= df['FromDate1']))]

当然,此代码会给您空白结果。可能是因为列的数据类型不同。

首先通过df.dtypes确保那些列(您要比较的列具有相同的数据类型)如果您要比较的数据是日期,则使用相同的格式进行比较。例如,如果你想比较 "YYYYMM" 你应该将它与格式为 "YYYYMM"

的日期进行比较