使用 dplyr 对数据框两列中的对象进行分类

Question

您好，我有一个示例数据框，如下所示：

   Policy_Holder_ID Insured_ID
   <chr>            <chr>     
 1 ID27343          ID215664  
 2 ID27310          ID27310   
 3 ID27343          ID205729  
 4 ID27343          ID205728  
 5 ID27348          ID205734  
 6 ID27348          ID205735  
 7 ID27315          ID205719  
 8 ID27315          ID27315   
 9 ID27345          ID205731  
10 ID27345          ID205733  
11 ID27345          ID27345   
12 ID2731           ID2731    
13 ID27310          ID205714  
14 ID27310          ID205715

抱歉，如果它不是 dput 形式。我尝试使用此功能，但没有得到正确的结果

我想要的是将此数据框分为 3 个不同的类别，如下所列：

第 1 组：仅为自己投保的保单持有人。换句话说，Policy_Holder_ID 和 Insured_ID 是相同的（例如：ID2731）
第2组：只为他人购买保险的投保人。换句话说，它们在 Policy_Holder_ID 中列出但不在 Insured_ID 中并且有 1 个或多个 Insured_ID（示例：ID27343）
第3组：为自己和他人购买保险的投保人（例如：ID27310）

所以输出应该是这样的：

   Policy_Holder_ID Insured_ID    group
   <chr>            <chr>     
 1 ID27343          ID215664         2
 2 ID27310          ID27310          3
 3 ID27343          ID205729         2
 4 ID27343          ID205728         2
 5 ID27348          ID205734         2
 6 ID27348          ID205735         2
 7 ID27315          ID205719         3
 8 ID27315          ID27315          3  
 9 ID27345          ID205731         3
10 ID27345          ID205733         3
11 ID27345          ID27345          3  
12 ID2731           ID2731           1  
13 ID27310          ID205714         3 
14 ID27310          ID205715         3

我希望您能提供一种节省时间的解决方案，而不是对数据使用 for 循环。我的原始数据有超过 400000 行，所以 for 循环对我没有帮助。

Answer 1

按'Policy_Holder_ID'分组后，我们可以使用case_when。根据描述，如果 'Insured_ID' 的 all 个元素与 'Policy_Holder_ID' 匹配，则 return 1，如果其中 none 个匹配 (!= -> 再次使用 all)，然后是 return 2，默认选项应该是 return 3.

library(dplyr)
df1 %>% 
  group_by(Policy_Holder_ID) %>%
  mutate(group = case_when(all(Insured_ID == Policy_Holder_ID) ~ 1, 
          all(Insured_ID != Policy_Holder_ID)~ 2, 
       TRUE ~ 3)) %>%
  ungroup

-输出

# A tibble: 14 x 3
#   Policy_Holder_ID Insured_ID group
#   <chr>            <chr>      <dbl>
# 1 ID27343          ID215664       2
# 2 ID27310          ID27310        3
# 3 ID27343          ID205729       2
# 4 ID27343          ID205728       2
# 5 ID27348          ID205734       2
# 6 ID27348          ID205735       2
# 7 ID27315          ID205719       3
# 8 ID27315          ID27315        3
# 9 ID27345          ID205731       3
#10 ID27345          ID205733       3
#11 ID27345          ID27345        3
#12 ID2731           ID2731         1
#13 ID27310          ID205714       3
#14 ID27310          ID205715       3

数据

df1 <- structure(list(Policy_Holder_ID = c("ID27343", "ID27310", "ID27343", 
"ID27343", "ID27348", "ID27348", "ID27315", "ID27315", "ID27345", 
"ID27345", "ID27345", "ID2731", "ID27310", "ID27310"), Insured_ID = c("ID215664", 
"ID27310", "ID205729", "ID205728", "ID205734", "ID205735", "ID205719", 
"ID27315", "ID205731", "ID205733", "ID27345", "ID2731", "ID205714", 
"ID205715")), class = "data.frame", row.names = c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14"))

Answer 2

# Find out which are same and which are not same
df1 <- df %>% filter(Policy_Holder_ID == Insured_ID) %>% mutate(group = 1)
df2 <- df %>% filter(Policy_Holder_ID != Insured_ID) %>% mutate(group = 2)
# find out which are the common ones
common_Policy_Holder_ID <- intersect(df1$Policy_Holder_ID, df2$Policy_Holder_ID)
# if they are in common, then change the value of group
df <- bind_rows(df1, df2) %>% if_else(common_Policy_Holder_ID == Policy_Holder_ID, group = 3, group = group)

Answer 3

使用嵌套 ifelse

的 data.table 选项

setDT(df)[
  ,
  group := ifelse(
    all(Policy_Holder_ID == Insured_ID),
    1,
    ifelse(
      !unique(Policy_Holder_ID) %in% Insured_ID,
      2,
      3
    )
  ),
  Policy_Holder_ID
]

给予

> df
    Policy_Holder_ID Insured_ID group
 1:          ID27343   ID215664     2
 2:          ID27310    ID27310     3
 3:          ID27343   ID205729     2
 4:          ID27343   ID205728     2
 5:          ID27348   ID205734     2
 6:          ID27348   ID205735     2
 7:          ID27315   ID205719     3
 8:          ID27315    ID27315     3
 9:          ID27345   ID205731     3
10:          ID27345   ID205733     3
11:          ID27345    ID27345     3
12:           ID2731     ID2731     1
13:          ID27310   ID205714     3
14:          ID27310   ID205715     3

数据

> dput(df)
structure(list(Policy_Holder_ID = c("ID27343", "ID27310", "ID27343",
"ID27343", "ID27348", "ID27348", "ID27315", "ID27315", "ID27345",
"ID27345", "ID27345", "ID2731", "ID27310", "ID27310"), Insured_ID = c("ID215664",
"ID27310", "ID205729", "ID205728", "ID205734", "ID205735", "ID205719",
"ID27315", "ID205731", "ID205733", "ID27345", "ID2731", "ID205714",
"ID205715")), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14"))

Answer 4

@Roozbeh

您是否尝试过将此文件拆分为多个文件以提高每个文件的计算速度？

如果其他答案无效，您可以试试这个：

尝试使用 bigreadr 包。尝试执行以下步骤：

将您的数据框保存到 .csv 文件中；
使用split_file函数拆分——像这样：

split_file('your_file.csv', every_nlines = 'how much lines for each file' , prefix_out = '输出名称', repeat_header = T)
读取每个文件做for循环；和
执行 rbind 将所有文件重新组合为一个文件。

使用 dplyr 对数据框两列中的对象进行分类

Using dplyr to categorize object in two columns of dataframe

r

categorization

dplyr

数据