pandas 如何根据条件删除重复的行
pandas how to drop duplicated rows based on conditions
我有一个df,你可以通过这个代码:
import numpy as np
import pandas as pd
from io import StringIO
dfs = """
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
84 GA16 77 T G 19940701 19480101.0 197702.0 190001.0
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
"""
df = pd.read_csv(StringIO(dfs.strip()), sep='\s+',
dtype={"RB": int}
)
df
输出:
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
84 GA16 77 T G 19940701 19480101.0 197702.0 190001.0
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
对于这个df,一组有RB的合约是唯一的,也就是说只剩下1行,条件是:
FromDate1<=df.IssueDate<=ToDate1
所以我累了:
df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
(df['IssueDate'] <= df['ToDate1']) &
(df['IssueDate'] >= df['FromDate1']))]
但是输出是空白的:
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
预期的输出应该是:
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
有朋友可以帮忙吗?
将你的IssueDate
除以100,可能是时间换算单位的问题:
>>> df.loc[df['IssueDate'].div(100).between(df['FromDate1'], df['ToDate1'])]
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
检查一下:
>>> df.loc[85, ['IssueDate', 'FromDate1', 'ToDate1']].astype(int)
IssueDate 19940701
FromDate1 197703
ToDate1 999999
Name: 85, dtype: int64
让我们先讨论你的答案
df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
(df['IssueDate'] <= df['ToDate1']) &
(df['IssueDate'] >= df['FromDate1']))]
当然,此代码会给您空白结果。可能是因为列的数据类型不同。
首先通过df.dtypes
确保那些列(您要比较的列具有相同的数据类型)如果您要比较的数据是日期,则使用相同的格式进行比较。例如,如果你想比较 "YYYYMM"
你应该将它与格式为 "YYYYMM"
的日期进行比较
我有一个df,你可以通过这个代码:
import numpy as np
import pandas as pd
from io import StringIO
dfs = """
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
84 GA16 77 T G 19940701 19480101.0 197702.0 190001.0
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
"""
df = pd.read_csv(StringIO(dfs.strip()), sep='\s+',
dtype={"RB": int}
)
df
输出:
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
84 GA16 77 T G 19940701 19480101.0 197702.0 190001.0
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
对于这个df,一组有RB的合约是唯一的,也就是说只剩下1行,条件是:
FromDate1<=df.IssueDate<=ToDate1
所以我累了:
df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
(df['IssueDate'] <= df['ToDate1']) &
(df['IssueDate'] >= df['FromDate1']))]
但是输出是空白的:
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
预期的输出应该是:
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
有朋友可以帮忙吗?
将你的IssueDate
除以100,可能是时间换算单位的问题:
>>> df.loc[df['IssueDate'].div(100).between(df['FromDate1'], df['ToDate1'])]
contract RB RateCompany gs IssueDate ValIssueDate ToDate1 FromDate1
85 GA16 77 T G 19940701 19980101.0 999999.0 197703.0
检查一下:
>>> df.loc[85, ['IssueDate', 'FromDate1', 'ToDate1']].astype(int)
IssueDate 19940701
FromDate1 197703
ToDate1 999999
Name: 85, dtype: int64
让我们先讨论你的答案
df = df[((df.duplicated(subset=["contract", "RB"], keep=False)) &
(df['IssueDate'] <= df['ToDate1']) &
(df['IssueDate'] >= df['FromDate1']))]
当然,此代码会给您空白结果。可能是因为列的数据类型不同。
首先通过df.dtypes
确保那些列(您要比较的列具有相同的数据类型)如果您要比较的数据是日期,则使用相同的格式进行比较。例如,如果你想比较 "YYYYMM"
你应该将它与格式为 "YYYYMM"