通过合并 R 中的 ID 将多行变成一个字符串

Question

表1:(有几百个ID)

 participant_id hpo_term              year_of_birth  affected_relative   genome
    123         kidney failure          2000               Y                38
    123         hand tremor             2000               Y                38
    123         kidney transplant       2000               Y                38
    432         hypertension            1980               N                37
    432         exotropia               1980               N                37
    432         scissor gait            1980               N                37

我有两个查找表：（每个表有数百个值）

肾脏检查：

kidney failure
kidney transplant
hypertension

非肾脏查询（每个查询有数百个值）：

hand tremor
exotropia
scissor gait

期望的结果：

participant_id kidney_hpo_term                   non_kidney_hpo_term    year_of_birth affected_relative   genome
123            kidney failure;kidney transplant  hand tremor            2000              Y                 38
432            hypertension                      exotropia;scissor gait 1980              Y                 37

最初我试过：

library(dplyr); library(tidyr)
pt.data %>% 
   mutate(kidney = hpo_term %in% kidney.hpo) %>%
   pivot_wider(names_from = kidney, values_from = hpo_term,
               values_fn = function(x)paste(x,collapse = ";"), values_fill = NA) %>%
   setNames(c("participant_id","Kidney","Non.kidney"))

和kidney.hpo <- read.delim("kidney_hpo_terms.txt", header = F)

但我收到“values_fn[[value]] 中的错误；'closure' 类型的对象不可子集化”

不确定我做错了什么，非常感谢您的帮助。

Answer 1

关于您的数据有几件事要说。

首先，您的 table1 有重复的列：year_of_birth、affected_relative 和 genome 对于给定的参与者是相同的。

这最好存储在一个单独的 table 中，我将其命名为 table1_short。

对于您的问题，只需要检查一个项是否在向量中，这是使用 %in% 完成的。

代码的编写方式如下：

library(tidyverse)
table1=read.table(header=T, text="
participant_id hpo_term              year_of_birth  affected_relative   genome
123         'kidney failure'          2000               Y                38
123         'hand tremor'             2000               Y                38
123         'kidney transplant'       2000               Y                38
432         hypertension              1980               N                37
432         exotropia                 1980               N                37
432         'scissor gait'            1980               N                37")

table1_short = table1 %>% select(-hpo_term) %>% group_by(participant_id) %>% slice(1)
table1_long = table1 %>% select(1:2)

renal_lookup = c("kidney failure", "kidney transplant", "hypertension")
nonrenal_lookup = c("hand tremor", "exotropia", "scissor gait")


table1_long %>% 
  group_by(participant_id) %>% 
  summarise(
    kidney_hpo_term = hpo_term[hpo_term %in% renal_lookup] %>% paste(collapse=";"),
    non_kidney_hpo_term = hpo_term[hpo_term %in% nonrenal_lookup] %>% paste(collapse=";")
  ) %>% 
  left_join(table1_short, by="participant_id")
#> # A tibble: 2 x 6
#>   participant_id kidney_hpo_term                  non_kidney_hpo_term    year_of_birth affected_relative genome
#>            <int> <chr>                            <chr>                          <int> <chr>              <int>
#> 1            123 kidney failure;kidney transplant hand tremor                     2000 Y                     38
#> 2            432 hypertension                     exotropia;scissor gait          1980 N                     37

^{由 reprex package (v2.0.0)}

创建于 2021-05-12

Answer 2

这可以通过 data.table 中的 dcast 完成，如下所示：

dtt[, group := paste0(
    ifelse(hpo_term %in% kidney_hpo, 'kidney', 'non_kidney'), '_hpo_term')]
dcast(dtt, ... ~ group, value.var = 'hpo_term',
    fun.aggregate = paste, collapse = ';')
#    participant_id year_of_birth affected_relative genome                  kidney_hpo_term
# 1:            123          2000                 Y     38 kidney failure;kidney transplant
# 2:            432          1980                 N     37                     hypertension
#       non_kidney_hpo_term
# 1:            hand tremor
# 2: exotropia;scissor gait"

通过合并 R 中的 ID 将多行变成一个字符串

Turning multiple rows into a string by merging on ID in R

lookup

datatable

r

subset

dplyr