根据R中另一列的条件从一列中删除重复项
Removing Duplicates from one Column based on conditions of another in R
我有一个数据集,我试图按订单周期计算留存客户的数量,但在数据集中有很多重复项,因为同一客户在多个时间段内订购了订单,因此它们是作为新条目(观察)包含在数据集中。不幸的是,其中许多包含重复的 IDs/Person 数字,所以我想知道是否有任何类型的正则表达式或过滤器我可以使用它来检查 retained 列然后删除重复的ID/Person number 如果 retained 中的值相同。
tibble::tribble(
~PERSONUM, ~ID, ~ORDER_PERIOD, ~retained,
10001685, 10109887, "201750", "Y",
10001685, 10109887, "201850", "Y",
10001685, 10109887, "201950", "Y",
10005733, 10162571, "201550", "Y",
10005787, 10112896, "201550", "Y",
10005795, 10112901, "201550", "Y",
10005795, 10112901, "201650", "Y",
10005795, 10112901, "201750", "Y",
10020043, 10156305, "202050", "Y",
10020165, 10122910, "201750", "Y",
10020165, 10122910, "201850", "Y",
10020649, 10123585, "201550", "N",
10028842, 10128545, "201750", "Y",
52300090, 10147580, "201850", "N",
52300740, 10149860, "201650", "N",
52300749, 10135925, "201750", "Y",
52300749, 10135925, "201850", "Y",
52300917, 10140173, "201650", "Y",
52300917, 10140173, "201750", "Y",
52300917, 10140173, "201850", "Y"
)
我正在考虑使用 df %>% filter(ID==ID) 但显然 ID 将始终等于自身,我知道有重复的函数并且我考虑过使用类似
df_cleaned <-df[!duplicated(df),]
但我需要代码来应用某种首先查看保留列的条件。
您可以使用 dplyr
中的 distinct 函数
df_cleaned <- df %>% distinct(PERSONUM, retained,.keep_all=TRUE)
以上代码保留具有不同“PERSONUM”和“retained”值的记录
也许下面的代码可以满足问题的要求。
它按 ID
和 retained
分组,然后仅保留每组的第一行,消除重复项。
library(dplyr)
orders %>%
group_by(ID, retained) %>%
filter(row_number() == first(row_number()))
备注
更简单,该代码和上面代码的结果是 identical
在 ungroup
上面的代码之后:
orders %>%
group_by(ID, retained) %>%
filter(row_number() == first(row_number())) %>%
ungroup() -> df1
orders %>% distinct(ID, retained, .keep_all = TRUE) -> df2
identical(df1, df2)
#[1] TRUE
我有一个数据集,我试图按订单周期计算留存客户的数量,但在数据集中有很多重复项,因为同一客户在多个时间段内订购了订单,因此它们是作为新条目(观察)包含在数据集中。不幸的是,其中许多包含重复的 IDs/Person 数字,所以我想知道是否有任何类型的正则表达式或过滤器我可以使用它来检查 retained 列然后删除重复的ID/Person number 如果 retained 中的值相同。
tibble::tribble(
~PERSONUM, ~ID, ~ORDER_PERIOD, ~retained,
10001685, 10109887, "201750", "Y",
10001685, 10109887, "201850", "Y",
10001685, 10109887, "201950", "Y",
10005733, 10162571, "201550", "Y",
10005787, 10112896, "201550", "Y",
10005795, 10112901, "201550", "Y",
10005795, 10112901, "201650", "Y",
10005795, 10112901, "201750", "Y",
10020043, 10156305, "202050", "Y",
10020165, 10122910, "201750", "Y",
10020165, 10122910, "201850", "Y",
10020649, 10123585, "201550", "N",
10028842, 10128545, "201750", "Y",
52300090, 10147580, "201850", "N",
52300740, 10149860, "201650", "N",
52300749, 10135925, "201750", "Y",
52300749, 10135925, "201850", "Y",
52300917, 10140173, "201650", "Y",
52300917, 10140173, "201750", "Y",
52300917, 10140173, "201850", "Y"
)
我正在考虑使用 df %>% filter(ID==ID) 但显然 ID 将始终等于自身,我知道有重复的函数并且我考虑过使用类似
df_cleaned <-df[!duplicated(df),]
但我需要代码来应用某种首先查看保留列的条件。
您可以使用 dplyr
中的 distinct 函数df_cleaned <- df %>% distinct(PERSONUM, retained,.keep_all=TRUE)
以上代码保留具有不同“PERSONUM”和“retained”值的记录
也许下面的代码可以满足问题的要求。
它按 ID
和 retained
分组,然后仅保留每组的第一行,消除重复项。
library(dplyr)
orders %>%
group_by(ID, retained) %>%
filter(row_number() == first(row_number()))
备注
identical
在 ungroup
上面的代码之后:
orders %>%
group_by(ID, retained) %>%
filter(row_number() == first(row_number())) %>%
ungroup() -> df1
orders %>% distinct(ID, retained, .keep_all = TRUE) -> df2
identical(df1, df2)
#[1] TRUE