R Dataframe过滤:根据时间因素使用唯一或重复的功能
R Dataframe filtering: Using unique or duplicate function based on time factor
我正在尝试过滤贷款数据的数据框,但如果贷款仍未偿还,每个月度报告都会重复贷款,或者如果贷款已支付则删除贷款(不能只使用最新的月度报告)。我想通过贷方过滤贷款的独特到期日并删除重复项并仅保留报告日期的最新数据。这是数据示例:
df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
, Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
,"Lender1","Lender1","Lender2","Lender2")
, Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
, Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
,150000,150000,2750000,2750000))
正如您从示例数据中看到的那样,Lender1 有 2 个不同的到期日。第一个日期有 3 笔贷款在 2 个报告日期重复,第二个到期日有 1 笔贷款重复。我想删除重复项以保留最新的报告数据。我希望之后得到一个看起来像这样的数据框:
Reporting.date
Lender.name
Date.of.maturity
Loan.amount
8/31/2021
Lender1
6/20/2025
13129474
8/31/2021
Lender1
6/20/2025
14643881
8/31/2021
Lender1
6/20/2025
44935677
7/31/2021
Lender1
6/30/2022
150000
12/31/2020
Lender2
5/15/2024
2750000
您需要将 Reporting.date
转换为日期格式,在 mutate
中(就像我所做的那样)或直接在 filter
.
中
library(tidyverse)
df %>%
mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
filter(Reporting.date == max(Reporting.date)) %>%
ungroup()
我们也可以 arrange
library(dplyr)
library(lubridate)
df %>%
arrange(Lender.name, Date.of.maturity, Loan.amount,
desc(mdy(Reporting.date))) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
slice_head(n = 1) %>%
ungroup
-输出
# A tibble: 5 x 4
Reporting.date Lender.name Date.of.maturity Loan.amount
<chr> <chr> <chr> <dbl>
1 8/31/2021 Lender1 6/20/2025 13129474
2 8/31/2021 Lender1 6/20/2025 14643881
3 8/31/2021 Lender1 6/20/2025 44935677
4 7/31/2021 Lender1 6/30/2022 150000
5 12/31/2020 Lender2 5/15/2024 2750000
Base R 选项使用 subset
、transform
和 ave
-
subset(transform(df, Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')),
Reporting.date == ave(Reporting.date, Lender.name, Date.of.maturity, FUN = max))
# Reporting.date Lender.name Date.of.maturity Loan.amount
#4 2021-08-31 Lender1 6/20/2025 13129474
#5 2021-08-31 Lender1 6/20/2025 14643881
#6 2021-08-31 Lender1 6/20/2025 44935677
#8 2021-07-31 Lender1 6/30/2022 150000
#10 2020-12-31 Lender2 5/15/2024 2750000
我正在尝试过滤贷款数据的数据框,但如果贷款仍未偿还,每个月度报告都会重复贷款,或者如果贷款已支付则删除贷款(不能只使用最新的月度报告)。我想通过贷方过滤贷款的独特到期日并删除重复项并仅保留报告日期的最新数据。这是数据示例:
df <- data.frame(Reporting.date=c("6/30/2020","6/30/2020","6/30/2020","8/31/2021","8/31/2021"
,"8/31/2021","6/30/2020","7/31/2021","5/31/2020","12/31/2020")
, Lender.name=c("Lender1","Lender1","Lender1","Lender1","Lender1","Lender1"
,"Lender1","Lender1","Lender2","Lender2")
, Date.of.maturity=c("6/20/2025","6/20/2025","6/20/2025","6/20/2025","6/20/2025"
,"6/20/2025","6/30/2022","6/30/2022","5/15/2024","5/15/2024")
, Loan.amount=c(13129474,14643881,44935677,13129474,14643881,44935677
,150000,150000,2750000,2750000))
正如您从示例数据中看到的那样,Lender1 有 2 个不同的到期日。第一个日期有 3 笔贷款在 2 个报告日期重复,第二个到期日有 1 笔贷款重复。我想删除重复项以保留最新的报告数据。我希望之后得到一个看起来像这样的数据框:
Reporting.date | Lender.name | Date.of.maturity | Loan.amount |
---|---|---|---|
8/31/2021 | Lender1 | 6/20/2025 | 13129474 |
8/31/2021 | Lender1 | 6/20/2025 | 14643881 |
8/31/2021 | Lender1 | 6/20/2025 | 44935677 |
7/31/2021 | Lender1 | 6/30/2022 | 150000 |
12/31/2020 | Lender2 | 5/15/2024 | 2750000 |
您需要将 Reporting.date
转换为日期格式,在 mutate
中(就像我所做的那样)或直接在 filter
.
library(tidyverse)
df %>%
mutate(Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
filter(Reporting.date == max(Reporting.date)) %>%
ungroup()
我们也可以 arrange
library(dplyr)
library(lubridate)
df %>%
arrange(Lender.name, Date.of.maturity, Loan.amount,
desc(mdy(Reporting.date))) %>%
group_by(Lender.name, Date.of.maturity, Loan.amount) %>%
slice_head(n = 1) %>%
ungroup
-输出
# A tibble: 5 x 4
Reporting.date Lender.name Date.of.maturity Loan.amount
<chr> <chr> <chr> <dbl>
1 8/31/2021 Lender1 6/20/2025 13129474
2 8/31/2021 Lender1 6/20/2025 14643881
3 8/31/2021 Lender1 6/20/2025 44935677
4 7/31/2021 Lender1 6/30/2022 150000
5 12/31/2020 Lender2 5/15/2024 2750000
Base R 选项使用 subset
、transform
和 ave
-
subset(transform(df, Reporting.date = as.Date(Reporting.date, format = '%m/%d/%Y')),
Reporting.date == ave(Reporting.date, Lender.name, Date.of.maturity, FUN = max))
# Reporting.date Lender.name Date.of.maturity Loan.amount
#4 2021-08-31 Lender1 6/20/2025 13129474
#5 2021-08-31 Lender1 6/20/2025 14643881
#6 2021-08-31 Lender1 6/20/2025 44935677
#8 2021-07-31 Lender1 6/30/2022 150000
#10 2020-12-31 Lender2 5/15/2024 2750000