Select 根据条件优化数量

Question

这是我的最小数据集：

df=structure(list(ID = c(3942504L, 3199413L, 1864266L, 4037617L, 
2030477L, 1342330L, 5434070L, 3200378L, 4810153L, 4886225L), 
    MI_TIME = c(1101L, 396L, 1140L, 417L, 642L, 1226L, 1189L, 
    484L, 766L, 527L), MI_Status = c(0L, 0L, 1L, 0L, 0L, 0L, 
    0L, 0L, 1L, 0L), Stroke_status = c(1L, 0L, 1L, 0L, 0L, 0L, 
    0L, 1L, 1L, 0L), Stroke_time = c(1101L, 396L, 1140L, 417L, 
    642L, 1226L, 1189L, 484L, 766L, 527L), Arrhythmia_status = c(NA, 
    NA, TRUE, NA, NA, TRUE, NA, NA, TRUE, NA), Arrythmia_time = c(1101L, 
    356L, 1122L, 7L, 644L, 126L, 118L, 84L, 76L, 5237L)), row.names = c(NA, 
10L), class = "data.frame")

如你所见，我主要有两种类型的变量“_status”和“_time”。

我正在为生存分析准备我的数据集，“时间”是指事件发生的时间（以天为单位）。

但是当我尝试创建一个名为“任何心血管结果”的变量时，问题就来了 (df$CV) 我定义如下：

df$CV = NA
df$CV <- with(df, ifelse(MI_Status=='1' | Stroke_status=='1' | Arrhythmia_status== 'TRUE'  ,'1', '0'))              
df$CV = as.factor(df$CV)

我遇到的问题是 select 最佳事件发生时间。现在我有一个名为 df$CV 的新变量，但有 3 个不同的“_time”变量。所以我想创建一个名为 df$CV_time 的新列，其中时间是最先发生的事件的时间。不过这个问题有点难度，我举个例子：

如果我们有一个主题 MI_status==1、Arrythmia_status==NA、stroke_status==1 和 MI_time==200、Arrythmia_time==100、stroke_time==220 --> df$CV 的正确时间是 200，因为这是最早事件的时间。

但是，在 MI_status==0、Arrythmia_status==NA、stroke_status==0 和 MI_time==200、Arrythmia_time==100、stroke_time==220 的情况下 --> df$CV 的正确时间应该是 220，因为最近跟进的时间是 220 天。

根据这些条件，我如何 select df$CV 的最佳数量？

Answer 1

这可能是一种使用 tidyverse 的方法。

首先，您可能需要确保您的列名与拼写和大小写一致（此处使用 rename）。

然后，您可以将“心律失常”结果明确定义为 TRUE 或 FALSE（而不是使用 NA）。

您可以使用 pivot_longer 将您的数据放入长格式，然后 group_by 您的 ID。您可以在此处包括与 MI、中风和心律失常相关的特定列（其中有“时间”和“状态”列可用）。请注意，在您的实际数据集中（您使用 glimpse 的地方 - 不清楚您想要什么心律失常 - 有一个 pif 列名称，但没有特定的时间或状态）。

您的心血管结果将包括 MI 或中风状态为 1，或心律失常状态为 TRUE。

如果有心血管结果，事件发生时间将是 min 时间，否则使用最近随访的截尾时间或 max 时间。

如果这能为您提供所需的输出，请告诉我。

library(tidyverse)

df %>%
  rename(MI_time = MI_TIME, MI_status = MI_Status, Arrhythmia_time = Arrythmia_time) %>%
  replace_na(list(Arrhythmia_status = F)) %>%
  pivot_longer(cols = c(starts_with("MI_"), starts_with("Stroke_"), starts_with("Arrhythmia_")), 
               names_to = c("event", ".value"), 
               names_sep = "_") %>%
  group_by(ID) %>%
  summarise(
    any_cv_outcome = any(status[event %in% c("MI", "Stroke")] == 1 | status[event == "Arrhythmia"]),
    cv_time_to_event = ifelse(any_cv_outcome, min(time), max(time))
  )

输出

        ID any_cv_outcome cv_time_to_event
     <int> <lgl>                     <int>
 1 1342330 TRUE                        126
 2 1864266 TRUE                       1122
 3 2030477 FALSE                       644
 4 3199413 FALSE                       396
 5 3200378 TRUE                         84
 6 3942504 TRUE                       1101
 7 4037617 FALSE                       417
 8 4810153 TRUE                         76
 9 4886225 FALSE                      5237
10 5434070 FALSE                      1189

Select 根据条件优化数量

Select the optimal number based on conditions

if-statement

r

multiple-columns

conditional-statements