如何知道数据是否在使用 r 的生存分析中被审查
How to know if the data is censored in a survival analysis using r
我有一个看起来像这样的数据集(一个无意义的例子):
id <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1990, 1991, 1992, 1989, 1990, 1991, 1992, 1993, 1989, 1990, 1992, 1993)
event<- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1)
df <- cbind(id, year, event)
假设从 1989 年到死亡期间对所有三个 ID 进行连续观察。但是,如您所见,id 1 是左删失的(从开始没有信息),id 2 是右删失的(没有从开始或结束的信息),id 3 在观察上有差距(从开始和结束的信息但有差距).在小的table中这很容易看出,但在处理大数据集时就变得困难了。
编辑:
有没有一种方法可以按 id 分组并创建一个摘要 table,其中包含有关数据完整性的信息,例如:
id left-censored right-censored gaps in obs.
1 1 0 0
2 0 1 0
3 0 0 1
你可以按ID分组(我使用dplyr)你的data.frame(我使用tibble)然后创建新的指示每个 ID 的第一年观察是否为 1989 年、此人是否在观察期间死亡以及每个 ID 的行数是否等于时间跨度的变量(max_year - min_year + 1).在这种情况下,我认为 ID 2 没有被删失,因为她观察的第一年是 1989 年,您将其定义为起始年。
library(tibble)
library(dplyr)
id <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1990, 1991, 1992, 1989, 1990, 1991, 1992, 1993, 1989, 1990, 1992, 1993)
deceased <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1)
df <- tibble(id, year, deceased)
start_year <- 1989
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) ## has gaps,
结果:
# A tibble: 12 x 6
# Groups: id [3]
id year deceased left_censored right_censored has_gaps
<dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 1 1990 0 TRUE FALSE FALSE
2 1 1991 0 TRUE FALSE FALSE
3 1 1992 1 TRUE FALSE FALSE
4 2 1989 0 FALSE TRUE FALSE
5 2 1990 0 FALSE TRUE FALSE
6 2 1991 0 FALSE TRUE FALSE
7 2 1992 0 FALSE TRUE FALSE
8 2 1993 0 FALSE TRUE FALSE
9 3 1989 0 FALSE FALSE TRUE
10 3 1990 0 FALSE FALSE TRUE
11 3 1992 0 FALSE FALSE TRUE
12 3 1993 1 FALSE FALSE TRUE
编辑: 如果你想要一个概述,你可以添加:
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) %>%## has gaps,
dplyr::distinct(id, left_censored, right_censored, has_gaps) %>%
ungroup() %>%
summarise(left_censored = sum(left_censored), right_censored = sum(right_censored), has_gaps = sum(has_gaps))
并得到:
# A tibble: 1 x 3
left_censored right_censored has_gaps
<int> <int> <int>
1 1 1 1
正如我之前提到的:这里的 ID 2 不被视为左删失,因为她的开始日期是 1989 年。
Edit2:如果你拿走 ungroup() 你会得到你要求的概述:
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) %>%## has gaps,
dplyr::distinct(id, left_censored, right_censored, has_gaps) %>%
summarise(left_censored = sum(left_censored), right_censored = sum(right_censored), has_gaps = sum(has_gaps))
并得到:
id left_censored right_censored has_gaps
<dbl> <int> <int> <int>
1 1 1 0 0
2 2 0 1 0
3 3 0 0 1
我有一个看起来像这样的数据集(一个无意义的例子):
id <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1990, 1991, 1992, 1989, 1990, 1991, 1992, 1993, 1989, 1990, 1992, 1993)
event<- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1)
df <- cbind(id, year, event)
假设从 1989 年到死亡期间对所有三个 ID 进行连续观察。但是,如您所见,id 1 是左删失的(从开始没有信息),id 2 是右删失的(没有从开始或结束的信息),id 3 在观察上有差距(从开始和结束的信息但有差距).在小的table中这很容易看出,但在处理大数据集时就变得困难了。
编辑: 有没有一种方法可以按 id 分组并创建一个摘要 table,其中包含有关数据完整性的信息,例如:
id left-censored right-censored gaps in obs.
1 1 0 0
2 0 1 0
3 0 0 1
你可以按ID分组(我使用dplyr)你的data.frame(我使用tibble)然后创建新的指示每个 ID 的第一年观察是否为 1989 年、此人是否在观察期间死亡以及每个 ID 的行数是否等于时间跨度的变量(max_year - min_year + 1).在这种情况下,我认为 ID 2 没有被删失,因为她观察的第一年是 1989 年,您将其定义为起始年。
library(tibble)
library(dplyr)
id <- c(1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3)
year <- c(1990, 1991, 1992, 1989, 1990, 1991, 1992, 1993, 1989, 1990, 1992, 1993)
deceased <- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1)
df <- tibble(id, year, deceased)
start_year <- 1989
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) ## has gaps,
结果:
# A tibble: 12 x 6
# Groups: id [3]
id year deceased left_censored right_censored has_gaps
<dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 1 1990 0 TRUE FALSE FALSE
2 1 1991 0 TRUE FALSE FALSE
3 1 1992 1 TRUE FALSE FALSE
4 2 1989 0 FALSE TRUE FALSE
5 2 1990 0 FALSE TRUE FALSE
6 2 1991 0 FALSE TRUE FALSE
7 2 1992 0 FALSE TRUE FALSE
8 2 1993 0 FALSE TRUE FALSE
9 3 1989 0 FALSE FALSE TRUE
10 3 1990 0 FALSE FALSE TRUE
11 3 1992 0 FALSE FALSE TRUE
12 3 1993 1 FALSE FALSE TRUE
编辑: 如果你想要一个概述,你可以添加:
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) %>%## has gaps,
dplyr::distinct(id, left_censored, right_censored, has_gaps) %>%
ungroup() %>%
summarise(left_censored = sum(left_censored), right_censored = sum(right_censored), has_gaps = sum(has_gaps))
并得到:
# A tibble: 1 x 3
left_censored right_censored has_gaps
<int> <int> <int>
1 1 1 1
正如我之前提到的:这里的 ID 2 不被视为左删失,因为她的开始日期是 1989 年。
Edit2:如果你拿走 ungroup() 你会得到你要求的概述:
df %>% group_by(id) %>% mutate(left_censored = min(year) > start_year, ## left censored, if first year is after 1988
right_censored = max(deceased) == 0, ## right censored, if did not die within observation
has_gaps = n() < max(year) - min(year) + 1) %>%## has gaps,
dplyr::distinct(id, left_censored, right_censored, has_gaps) %>%
summarise(left_censored = sum(left_censored), right_censored = sum(right_censored), has_gaps = sum(has_gaps))
并得到:
id left_censored right_censored has_gaps
<dbl> <int> <int> <int>
1 1 1 0 0
2 2 0 1 0
3 3 0 0 1