根据条件集向数据集添加行
Add rows to dataset depending on set of conditions
我有以下数据集:
individual number treatment
1 1 AAAA
1 2 BBBB
1 3 CCCC
1 4 EEEE
1 5 XXXX
1 7 WWWW
2 2 EEEE
2 3 AAAA
2 5 RRRR
个人最多可以接受 7 次治疗,但有些人最多只能接受 5 次治疗(如下例 individual_id=2)。我需要为每个人添加新行,直到他们进行的最大治疗次数(例如 individual_id=1 最多 7 次,individual_id=2 最多 5 次),治疗 = NA。我想要这样的东西:
individual_id number treatment
1 1 AAAA
1 2 BBBB
1 3 CCCC
1 4 EEEE
1 5 XXXX
1 6 NA
1 7 WWWW
2 1 NA
2 2 EEEE
2 3 AAAA
2 4 NA
2 5 RRRR
这是我的实际数据集的可重现示例:
structure(list(individual_id = c(21L, 21L, 21L, 21L, 21L, 21L,
22L, 22L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L, 24L,
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 25L, 25L,
25L, 25L, 25L, 25L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L,
26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L,
26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L,
27L), number = c(2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 7, 7, 7,
7, 7, 7, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4), treatment = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Adalimumab", "Etanercept",
"Infliximab", "Rituximab"), class = "factor")), row.names = c(NA,
-72L), class = "data.frame")
为此我们可以使用 tidyverse:
library(tidyverse)
df %>%
group_by(individual) %>%
complete(nesting(individual), number = seq(min(number), max(number), 1))
# # A tibble: 12 x 3
# # Groups: individual [2]
# individual number treatment
# <int> <dbl> <fct>
# 1 1 1 AAAA
# 2 1 2 BBBB
# 3 1 3 CCCC
# 4 1 4 EEEE
# 5 1 5 XXXX
# 6 1 6 NA
# 7 1 7 WWWW
# 8 2 1 EEEE
# 9 2 2 NA
# 10 2 3 AAAA
# 11 2 4 NA
# 12 2 5 RRRR
注:对于这个具体问题,根据下面的评论,number = seq(min(number), max(number), 1)
...应该是number = seq(1, max(number), 1)
,因为1
是总是第一个 number
无论它是否存在于记录中。但我将其保留在上面的行中,因为这似乎是一个更通用的解决方案。
考虑为所有可能的个体和治疗编号配对构建一个辅助数据框,然后 运行 与原始数据集进行左连接合并。
下方by
按个人拆分,使用expand.grid
迭代构建数据框,用于个人[=22]的所有成对组合=] 和 数字 。最后,do.call
将组子集数据帧列表绑定到一个最终数据帧中:fill_df.
fill_df <- do.call(rbind, by(df, df$individual, function(sub)
expand.grid(individual = unique(sub$individual),
number = 1:max(sub$number))
)
)
final_df <- merge(fill_df, df, all.x=TRUE)
final_df
# individual number treatment
# 1 1 1 AAAA
# 2 1 2 BBBB
# 3 1 3 CCCC
# 4 1 4 EEEE
# 5 1 5 XXXX
# 6 1 6 <NA>
# 7 1 7 WWWW
# 8 2 1 EEEE
# 9 2 2 <NA>
# 10 2 3 AAAA
# 11 2 4 <NA>
# 12 2 5 RRRR
我有以下数据集:
individual number treatment
1 1 AAAA
1 2 BBBB
1 3 CCCC
1 4 EEEE
1 5 XXXX
1 7 WWWW
2 2 EEEE
2 3 AAAA
2 5 RRRR
个人最多可以接受 7 次治疗,但有些人最多只能接受 5 次治疗(如下例 individual_id=2)。我需要为每个人添加新行,直到他们进行的最大治疗次数(例如 individual_id=1 最多 7 次,individual_id=2 最多 5 次),治疗 = NA。我想要这样的东西:
individual_id number treatment
1 1 AAAA
1 2 BBBB
1 3 CCCC
1 4 EEEE
1 5 XXXX
1 6 NA
1 7 WWWW
2 1 NA
2 2 EEEE
2 3 AAAA
2 4 NA
2 5 RRRR
这是我的实际数据集的可重现示例:
structure(list(individual_id = c(21L, 21L, 21L, 21L, 21L, 21L,
22L, 22L, 22L, 22L, 22L, 22L, 23L, 23L, 23L, 23L, 23L, 23L, 24L,
24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 24L, 25L, 25L,
25L, 25L, 25L, 25L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L,
26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L, 26L,
26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 27L,
27L), number = c(2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 7, 7, 7,
7, 7, 7, 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4), treatment = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 4L, 4L, 4L, 4L, 4L,
4L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 4L,
4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Adalimumab", "Etanercept",
"Infliximab", "Rituximab"), class = "factor")), row.names = c(NA,
-72L), class = "data.frame")
为此我们可以使用 tidyverse:
library(tidyverse)
df %>%
group_by(individual) %>%
complete(nesting(individual), number = seq(min(number), max(number), 1))
# # A tibble: 12 x 3
# # Groups: individual [2]
# individual number treatment
# <int> <dbl> <fct>
# 1 1 1 AAAA
# 2 1 2 BBBB
# 3 1 3 CCCC
# 4 1 4 EEEE
# 5 1 5 XXXX
# 6 1 6 NA
# 7 1 7 WWWW
# 8 2 1 EEEE
# 9 2 2 NA
# 10 2 3 AAAA
# 11 2 4 NA
# 12 2 5 RRRR
注:对于这个具体问题,根据下面的评论,number = seq(min(number), max(number), 1)
...应该是number = seq(1, max(number), 1)
,因为1
是总是第一个 number
无论它是否存在于记录中。但我将其保留在上面的行中,因为这似乎是一个更通用的解决方案。
考虑为所有可能的个体和治疗编号配对构建一个辅助数据框,然后 运行 与原始数据集进行左连接合并。
下方by
按个人拆分,使用expand.grid
迭代构建数据框,用于个人[=22]的所有成对组合=] 和 数字 。最后,do.call
将组子集数据帧列表绑定到一个最终数据帧中:fill_df.
fill_df <- do.call(rbind, by(df, df$individual, function(sub)
expand.grid(individual = unique(sub$individual),
number = 1:max(sub$number))
)
)
final_df <- merge(fill_df, df, all.x=TRUE)
final_df
# individual number treatment
# 1 1 1 AAAA
# 2 1 2 BBBB
# 3 1 3 CCCC
# 4 1 4 EEEE
# 5 1 5 XXXX
# 6 1 6 <NA>
# 7 1 7 WWWW
# 8 2 1 EEEE
# 9 2 2 <NA>
# 10 2 3 AAAA
# 11 2 4 <NA>
# 12 2 5 RRRR