种子发芽数据:将时间数据从短格式转换为长格式以进行生存分析

Seed germination data: converting time to event data from short form to long form for survival analysis

我正在使用生存分析评估幼苗出苗率,我想自动将收集到的短格式数据转换为长格式数据以在 R 中进行分析。

这里是收集的数据格式和日期转换的例子:

prac.dat <- tribble(
  ~ID, ~ImbibtionStartDate, ~Survey1date, ~Survey1totalcounts, ~Survey2date, ~Survey2totalcounts,~Survey3date, ~Survey3totalcounts, ~Total_sown_seeds,
  #--/--------------------/-------------/--------------------/-------------/------------------/---------------/------------------/-----------------/
  "ID1", "3/22/2022 14:20","3/24/2022 16:45", 0, "3/25/2022 16:00", 8, "3/26/2022 13:00", 21, 25,
  "ID2", "3/22/2022 14:20","3/24/2022 16:45", 1, "3/25/2022 16:00", 4, "3/26/2022 13:00", 11, 25,
)

prac.dat <- prac.dat %>% 
  mutate(ImbibtionStartDate=as.POSIXct(ImbibtionStartDate, format="%m/%d/%Y %H:%M"),
         Survey1date=as.POSIXct(Survey1date, format="%m/%d/%Y %H:%M"),
         Survey2date=as.POSIXct(Survey2date, format="%m/%d/%Y %H:%M"),
         Survey3date=as.POSIXct(Survey3date, format="%m/%d/%Y %H:%M"))

在这个数据集中,"ID"是播种盆的标识,"ImbibtionStartDate"是日期和土壤中的种子第一次浇水的时间,“Survey1date” [和其他调查日期列] 是进行调查以计算出苗总数的日期和时间,"Survey1totalcounts" [和其他调查计数列] 表示截至该调查日期该盆中已出苗的累计数量,"Total_sown_seeds" 表示播种在盆中的种子总数。

我的目标是一个数据集 1) 为每个盆中的每个种子生成一行(盆标识由“ID”列表示),2) 指示种子是否出现(“1”)或没有出现在研究期间没有出现(“0”),以及 3) 计算每个种子出现的具体时间(根据首次发现幼苗的调查日期和时间与吸水开始日期之间的差异进行估计和时间)。

我希望最终输出看起来像这样:

desired.output <- tribble(
  ~ID, ~Emg_Poa, ~time_to_emg,
  #Unique Id for each Seed/
  #whether that seed emerged ("1") or not ("0") by the final survey date/
  #days it took for that seed to emerge from imbibtion start to survey date/
  "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07, "ID1",1, 3.07,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94,
  "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",1, 3.94, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA, "ID1",0, NA,
  "ID2",1, 2.10, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.07, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94,
  "ID2",1, 3.94, "ID2",1, 3.94, "ID2",1, 3.94, "ID2",0, NA, "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,
  "ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA,"ID2",0, NA, "ID2",0, NA, "ID2",0, NA, "ID2",0, NA
  )

到目前为止,我已经手动完成了从一个 excel 到另一个的这些转换,但是为了最大限度地减少错误和节省时间,我很好奇是否有人愿意提出一种方法在 R 中自动化这个过程。这个任务超出了我目前在 R 数据帧生成方面的功能能力。感谢您的时间、考虑和意见。

prac.dat 获得所需的输出有点棘手,但肯定是可能的。首先,让我们将 prac.dat 转换为“长”格式并计算一些有用的列:

prac.long <- prac.dat %>% 
  pivot_longer(matches('counts|Survey.*date'), names_to = c('survey_num', '.value'), names_pattern = 'Survey(\d)(.*)') %>% 
  rename(survey_date = date, count = totalcounts) %>% 
  group_by(ID) %>% 
  mutate(
    across(c(ImbibtionStartDate, survey_date), ~as.POSIXct(., format="%m/%d/%Y %H:%M")),
    not_emerged = Total_sown_seeds - max(count),
    time_to_emerge = survey_date - ImbibtionStartDate,
    emerged_at_survey = count - lag(count),
    emerged_at_survey = ifelse(is.na(emerged_at_survey), count[1], emerged_at_survey)
  ) 

  ID    ImbibtionStartDate  Total_sown_seeds survey_num survey_date         count not_emerged
  <chr> <dttm>                         <dbl> <chr>      <dttm>              <dbl>       <dbl>
1 ID1   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     0           4
2 ID1   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     8           4
3 ID1   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    21           4
4 ID2   2022-03-22 14:20:00               25 1          2022-03-24 16:45:00     1          14
5 ID2   2022-03-22 14:20:00               25 2          2022-03-25 16:00:00     4          14
6 ID2   2022-03-22 14:20:00               25 3          2022-03-26 13:00:00    11          14
# … with 2 more variables: time_to_emerge <drtn>, emerged_at_survey <dbl>

我们还需要计算未出现的种子数:

prac.unemerged <- select(prac.long, ID, not_emerged) %>% 
  distinct %>% 
  mutate(time_to_emerge = NA) %>% 
  rename(count = not_emerged)

  ID    count time_to_emerge
  <chr> <dbl> <lgl>         
1 ID1       4 NA            
2 ID2      14 NA  

最后,我们将出苗的种子数及其发芽时间与 data.unemerged 相结合,并使用 uncount 扩展到您想要的输出:

result <- select(prac.long, ID, time_to_emerge, count = emerged_at_survey) %>% 
  bind_rows(prac.unemerged) %>% 
  uncount(weights = count) %>% 
  mutate(Emg_poa = as.numeric(!is.na(time_to_emerge))) %>% 
  arrange(ID, time_to_emerge)

   ID    time_to_emerge Emg_poa
   <chr> <drtn>           <dbl>
 1 ID1   3.069444 days        1
 2 ID1   3.069444 days        1
 3 ID1   3.069444 days        1
 4 ID1   3.069444 days        1
 5 ID1   3.069444 days        1
 6 ID1   3.069444 days        1
 7 ID1   3.069444 days        1
 8 ID1   3.069444 days        1
 9 ID1   3.944444 days        1
10 ID1   3.944444 days        1