通过合并 R 中的 ID 将多行变成一个字符串
Turning multiple rows into a string by merging on ID in R
表1:(有几百个ID)
participant_id hpo_term year_of_birth affected_relative genome
123 kidney failure 2000 Y 38
123 hand tremor 2000 Y 38
123 kidney transplant 2000 Y 38
432 hypertension 1980 N 37
432 exotropia 1980 N 37
432 scissor gait 1980 N 37
我有两个查找表:(每个表有数百个值)
肾脏检查:
kidney failure
kidney transplant
hypertension
非肾脏查询(每个查询有数百个值):
hand tremor
exotropia
scissor gait
期望的结果:
participant_id kidney_hpo_term non_kidney_hpo_term year_of_birth affected_relative genome
123 kidney failure;kidney transplant hand tremor 2000 Y 38
432 hypertension exotropia;scissor gait 1980 Y 37
最初我试过:
library(dplyr); library(tidyr)
pt.data %>%
mutate(kidney = hpo_term %in% kidney.hpo) %>%
pivot_wider(names_from = kidney, values_from = hpo_term,
values_fn = function(x)paste(x,collapse = ";"), values_fill = NA) %>%
setNames(c("participant_id","Kidney","Non.kidney"))
和kidney.hpo <- read.delim("kidney_hpo_terms.txt", header = F)
但我收到“values_fn[[value]] 中的错误;'closure' 类型的对象不可子集化”
不确定我做错了什么,非常感谢您的帮助。
关于您的数据有几件事要说。
首先,您的 table1 有重复的列:year_of_birth
、affected_relative
和 genome
对于给定的参与者是相同的。
这最好存储在一个单独的 table 中,我将其命名为 table1_short
。
对于您的问题,只需要检查一个项是否在向量中,这是使用 %in%
完成的。
代码的编写方式如下:
library(tidyverse)
table1=read.table(header=T, text="
participant_id hpo_term year_of_birth affected_relative genome
123 'kidney failure' 2000 Y 38
123 'hand tremor' 2000 Y 38
123 'kidney transplant' 2000 Y 38
432 hypertension 1980 N 37
432 exotropia 1980 N 37
432 'scissor gait' 1980 N 37")
table1_short = table1 %>% select(-hpo_term) %>% group_by(participant_id) %>% slice(1)
table1_long = table1 %>% select(1:2)
renal_lookup = c("kidney failure", "kidney transplant", "hypertension")
nonrenal_lookup = c("hand tremor", "exotropia", "scissor gait")
table1_long %>%
group_by(participant_id) %>%
summarise(
kidney_hpo_term = hpo_term[hpo_term %in% renal_lookup] %>% paste(collapse=";"),
non_kidney_hpo_term = hpo_term[hpo_term %in% nonrenal_lookup] %>% paste(collapse=";")
) %>%
left_join(table1_short, by="participant_id")
#> # A tibble: 2 x 6
#> participant_id kidney_hpo_term non_kidney_hpo_term year_of_birth affected_relative genome
#> <int> <chr> <chr> <int> <chr> <int>
#> 1 123 kidney failure;kidney transplant hand tremor 2000 Y 38
#> 2 432 hypertension exotropia;scissor gait 1980 N 37
由 reprex package (v2.0.0)
创建于 2021-05-12
这可以通过 data.table
中的 dcast
完成,如下所示:
dtt[, group := paste0(
ifelse(hpo_term %in% kidney_hpo, 'kidney', 'non_kidney'), '_hpo_term')]
dcast(dtt, ... ~ group, value.var = 'hpo_term',
fun.aggregate = paste, collapse = ';')
# participant_id year_of_birth affected_relative genome kidney_hpo_term
# 1: 123 2000 Y 38 kidney failure;kidney transplant
# 2: 432 1980 N 37 hypertension
# non_kidney_hpo_term
# 1: hand tremor
# 2: exotropia;scissor gait"
表1:(有几百个ID)
participant_id hpo_term year_of_birth affected_relative genome
123 kidney failure 2000 Y 38
123 hand tremor 2000 Y 38
123 kidney transplant 2000 Y 38
432 hypertension 1980 N 37
432 exotropia 1980 N 37
432 scissor gait 1980 N 37
我有两个查找表:(每个表有数百个值)
肾脏检查:
kidney failure
kidney transplant
hypertension
非肾脏查询(每个查询有数百个值):
hand tremor
exotropia
scissor gait
期望的结果:
participant_id kidney_hpo_term non_kidney_hpo_term year_of_birth affected_relative genome
123 kidney failure;kidney transplant hand tremor 2000 Y 38
432 hypertension exotropia;scissor gait 1980 Y 37
最初我试过:
library(dplyr); library(tidyr)
pt.data %>%
mutate(kidney = hpo_term %in% kidney.hpo) %>%
pivot_wider(names_from = kidney, values_from = hpo_term,
values_fn = function(x)paste(x,collapse = ";"), values_fill = NA) %>%
setNames(c("participant_id","Kidney","Non.kidney"))
和kidney.hpo <- read.delim("kidney_hpo_terms.txt", header = F)
但我收到“values_fn[[value]] 中的错误;'closure' 类型的对象不可子集化”
不确定我做错了什么,非常感谢您的帮助。
关于您的数据有几件事要说。
首先,您的 table1 有重复的列:year_of_birth
、affected_relative
和 genome
对于给定的参与者是相同的。
这最好存储在一个单独的 table 中,我将其命名为 table1_short
。
对于您的问题,只需要检查一个项是否在向量中,这是使用 %in%
完成的。
代码的编写方式如下:
library(tidyverse)
table1=read.table(header=T, text="
participant_id hpo_term year_of_birth affected_relative genome
123 'kidney failure' 2000 Y 38
123 'hand tremor' 2000 Y 38
123 'kidney transplant' 2000 Y 38
432 hypertension 1980 N 37
432 exotropia 1980 N 37
432 'scissor gait' 1980 N 37")
table1_short = table1 %>% select(-hpo_term) %>% group_by(participant_id) %>% slice(1)
table1_long = table1 %>% select(1:2)
renal_lookup = c("kidney failure", "kidney transplant", "hypertension")
nonrenal_lookup = c("hand tremor", "exotropia", "scissor gait")
table1_long %>%
group_by(participant_id) %>%
summarise(
kidney_hpo_term = hpo_term[hpo_term %in% renal_lookup] %>% paste(collapse=";"),
non_kidney_hpo_term = hpo_term[hpo_term %in% nonrenal_lookup] %>% paste(collapse=";")
) %>%
left_join(table1_short, by="participant_id")
#> # A tibble: 2 x 6
#> participant_id kidney_hpo_term non_kidney_hpo_term year_of_birth affected_relative genome
#> <int> <chr> <chr> <int> <chr> <int>
#> 1 123 kidney failure;kidney transplant hand tremor 2000 Y 38
#> 2 432 hypertension exotropia;scissor gait 1980 N 37
由 reprex package (v2.0.0)
创建于 2021-05-12这可以通过 data.table
中的 dcast
完成,如下所示:
dtt[, group := paste0(
ifelse(hpo_term %in% kidney_hpo, 'kidney', 'non_kidney'), '_hpo_term')]
dcast(dtt, ... ~ group, value.var = 'hpo_term',
fun.aggregate = paste, collapse = ';')
# participant_id year_of_birth affected_relative genome kidney_hpo_term
# 1: 123 2000 Y 38 kidney failure;kidney transplant
# 2: 432 1980 N 37 hypertension
# non_kidney_hpo_term
# 1: hand tremor
# 2: exotropia;scissor gait"