R Dataframe过滤:根据时间因素使用唯一或重复的功能

R Dataframe filtering: Using unique or duplicate function based on time factor

我正在尝试过滤贷款数据的数据框,但如果贷款仍未偿还,每个月度报告都会重复贷款,或者如果贷款已支付则删除贷款(不能只使用最新的月度报告)。我想通过贷方过滤贷款的独特到期日并删除重复项并仅保留报告日期的最新数据。这是数据示例:

df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
                                  ,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
                 , Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
                                 ,"Lender1","Lender1","Lender2","Lender2")
                 , Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
                                      ,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
                 , Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
                                 ,150000,150000,2750000,2750000))

正如您从示例数据中看到的那样,Lender1 有 2 个不同的到期日。第一个日期有 3 笔贷款在 2 个报告日期重复,第二个到期日有 1 笔贷款重复。我想删除重复项以保留最新的报告数据。我希望之后得到一个看起来像这样的数据框:

Reporting.date Lender.name Date.of.maturity Loan.amount
8/31/2021 Lender1 6/20/2025 13129474
8/31/2021 Lender1 6/20/2025 14643881
8/31/2021 Lender1 6/20/2025 44935677
7/31/2021 Lender1 6/30/2022 150000
12/31/2020 Lender2 5/15/2024 2750000

您需要将 Reporting.date 转换为日期格式,在 mutate 中(就像我所做的那样)或直接在 filter.

library(tidyverse)

df %>%
  mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  filter(Reporting.date == max(Reporting.date)) %>%
  ungroup()

我们也可以 arrange

library(dplyr)
library(lubridate)
df %>%
  arrange(Lender.name, Date.of.maturity, Loan.amount, 
         desc(mdy(Reporting.date))) %>%
  group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
  slice_head(n = 1) %>%
  ungroup

-输出

# A tibble: 5 x 4
  Reporting.date Lender.name Date.of.maturity Loan.amount
  <chr>          <chr>       <chr>                  <dbl>
1 8/31/2021      Lender1     6/20/2025           13129474
2 8/31/2021      Lender1     6/20/2025           14643881
3 8/31/2021      Lender1     6/20/2025           44935677
4 7/31/2021      Lender1     6/30/2022             150000
5 12/31/2020     Lender2     5/15/2024            2750000

Base R 选项使用 subsettransformave -

subset(transform(df, Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')), 
       Reporting.date == ave(Reporting.date, Lender.name, Date.of.maturity, FUN = max))

#   Reporting.date Lender.name Date.of.maturity Loan.amount
#4      2021-08-31     Lender1        6/20/2025    13129474
#5      2021-08-31     Lender1        6/20/2025    14643881
#6      2021-08-31     Lender1        6/20/2025    44935677
#8      2021-07-31     Lender1        6/30/2022      150000
#10     2020-12-31     Lender2        5/15/2024     2750000