R,合并两个数据集,分成多列
R, Merge two datasets, splitting into multiple columns
我有两个数据集:
PeopleList<-structure(list(MRN = c("53634", "65708", "64320", "40458", "03935",
"67473", "20281", "52479", "10261", "40945", "40630", "92295",
"43505", "80719", "39492", "44720", "70691", "21351", "03457",
"02182"), DOB = c("9/13/1953", "4/5/1948", "4/18/1944", "9/6/1953",
"1/14/1957", "8/25/1952", "6/4/1967", "7/22/1988", "6/22/1947",
"5/10/1957", "1/12/1968", "4/3/1979", "8/26/1961", "5/25/1965",
"8/21/1955", "9/17/1936", "9/13/1965", "3/23/1942", "5/16/1992",
"3/6/1969"), Gender = c("Female", "Female", "Male", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female"), `Smoking Status` = c("Never Smoker", "Former Smoker",
"Never Smoker", "Never Smoker", "Former Smoker", "Former Smoker",
"Never Smoker", "Never Smoker", "Former Smoker", "Never Smoker",
"Never Smoker", "Former Smoker", "Never Smoker", "Former Smoker",
"Former Smoker", "Former Smoker", "Never Smoker", "Never Smoker",
"Never Smoker", "Never Smoker")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Complications<-structure(list(MRN = c("03412", "25052", "64320", "64320", "64320",
"47595", "47595", "45175", "45337", "93708", "03348", "12964",
"12964", "46272", "46272", "46272", "46272", "71331", "57923",
"57923"), `ENCOUNTER DIAGNOSES` = c("Rupture of implant of right breast, subsequent encounter [T85.43XD]; Rupture of implant of right breast, subsequent encounter [T85.43XD]; Rupture of implant of right breast, subsequent encounter [T85.43XD]",
"Breast asymmetry [N64.89]; Rupture of implant of left breast, sequela [T85.43XS]; Rupture of implant of left breast, sequela [T85.43XS]; Rupture of implant of left breast, sequela [T85.43XS]",
"Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]",
"Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]",
"Breast asymmetry [N64.89]", "Fat necrosis (segmental) of breast [N64.1]",
"Fat necrosis (segmental) of breast [N64.1]", "Hematoma of breast [N64.89]",
"Acquired breast deformity [N64.89]", "Capsular contracture of breast implant, sequela [T85.44XS]; Capsular contracture of breast implant, sequela [T85.44XS]; Capsular contracture of breast implant, sequela [T85.44XS]",
"Infected sebaceous cyst [L72.3, L08.9]", "Pain due to any device, implant or graft, subsequent encounter [T85.848D]",
"Pain due to any device, implant or graft, sequela [T85.848S]",
"Breast asymmetry [N64.89]", "Breast asymmetry [N64.89]", "Breast asymmetry [N64.89]",
"Breast asymmetry [N64.89]", "Acquired breast deformity [N64.89]",
"Hematoma of breast [N64.89]", "Hematoma of breast [N64.89]")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
“并发症”是一个数据框,里面有成千上万的人,我可能不一定关心。 “人物名单”是我关心的 500 人左右。我想做的是将来自“并发症”的信息合并到 MRN 的“PeopleList”中,只保留来自“PeopleList”的 MRN。
这部分很简单,我可以做到PeopleList<-PeopleList%>%left_join(Complications,by="MRN")
但问题是我只想合并不重复的“遇到诊断”,而且如果我有多个匹配的 MRN,我希望它们分成多列,而不是行(不应该超过 5-6 个新列顶部)。这就是我的意思:
这个怎么样?
PeopleList%>%left_join(
Complications %>% #pipework to have 1 row per MRN
unique() %>% #drop duplicates
group_by(MRN) %>%
mutate(
rank = row_number(), #rownumber per MRN
rank = paste('Diagnosis', rank, sep = "_") #give this a tidier name
) %>%
spread(rank, `ENCOUNTER DIAGNOSES`), #make this a 'wide' dataset rather than long
by = "MRN" #join on
)
我有两个数据集:
PeopleList<-structure(list(MRN = c("53634", "65708", "64320", "40458", "03935",
"67473", "20281", "52479", "10261", "40945", "40630", "92295",
"43505", "80719", "39492", "44720", "70691", "21351", "03457",
"02182"), DOB = c("9/13/1953", "4/5/1948", "4/18/1944", "9/6/1953",
"1/14/1957", "8/25/1952", "6/4/1967", "7/22/1988", "6/22/1947",
"5/10/1957", "1/12/1968", "4/3/1979", "8/26/1961", "5/25/1965",
"8/21/1955", "9/17/1936", "9/13/1965", "3/23/1942", "5/16/1992",
"3/6/1969"), Gender = c("Female", "Female", "Male", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female", "Female", "Female", "Female", "Female", "Female",
"Female", "Female"), `Smoking Status` = c("Never Smoker", "Former Smoker",
"Never Smoker", "Never Smoker", "Former Smoker", "Former Smoker",
"Never Smoker", "Never Smoker", "Former Smoker", "Never Smoker",
"Never Smoker", "Former Smoker", "Never Smoker", "Former Smoker",
"Former Smoker", "Former Smoker", "Never Smoker", "Never Smoker",
"Never Smoker", "Never Smoker")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Complications<-structure(list(MRN = c("03412", "25052", "64320", "64320", "64320",
"47595", "47595", "45175", "45337", "93708", "03348", "12964",
"12964", "46272", "46272", "46272", "46272", "71331", "57923",
"57923"), `ENCOUNTER DIAGNOSES` = c("Rupture of implant of right breast, subsequent encounter [T85.43XD]; Rupture of implant of right breast, subsequent encounter [T85.43XD]; Rupture of implant of right breast, subsequent encounter [T85.43XD]",
"Breast asymmetry [N64.89]; Rupture of implant of left breast, sequela [T85.43XS]; Rupture of implant of left breast, sequela [T85.43XS]; Rupture of implant of left breast, sequela [T85.43XS]",
"Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]",
"Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]; Extrusion of breast implant, subsequent encounter [T85.49XD]",
"Breast asymmetry [N64.89]", "Fat necrosis (segmental) of breast [N64.1]",
"Fat necrosis (segmental) of breast [N64.1]", "Hematoma of breast [N64.89]",
"Acquired breast deformity [N64.89]", "Capsular contracture of breast implant, sequela [T85.44XS]; Capsular contracture of breast implant, sequela [T85.44XS]; Capsular contracture of breast implant, sequela [T85.44XS]",
"Infected sebaceous cyst [L72.3, L08.9]", "Pain due to any device, implant or graft, subsequent encounter [T85.848D]",
"Pain due to any device, implant or graft, sequela [T85.848S]",
"Breast asymmetry [N64.89]", "Breast asymmetry [N64.89]", "Breast asymmetry [N64.89]",
"Breast asymmetry [N64.89]", "Acquired breast deformity [N64.89]",
"Hematoma of breast [N64.89]", "Hematoma of breast [N64.89]")), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
“并发症”是一个数据框,里面有成千上万的人,我可能不一定关心。 “人物名单”是我关心的 500 人左右。我想做的是将来自“并发症”的信息合并到 MRN 的“PeopleList”中,只保留来自“PeopleList”的 MRN。
这部分很简单,我可以做到PeopleList<-PeopleList%>%left_join(Complications,by="MRN")
但问题是我只想合并不重复的“遇到诊断”,而且如果我有多个匹配的 MRN,我希望它们分成多列,而不是行(不应该超过 5-6 个新列顶部)。这就是我的意思:
这个怎么样?
PeopleList%>%left_join(
Complications %>% #pipework to have 1 row per MRN
unique() %>% #drop duplicates
group_by(MRN) %>%
mutate(
rank = row_number(), #rownumber per MRN
rank = paste('Diagnosis', rank, sep = "_") #give this a tidier name
) %>%
spread(rank, `ENCOUNTER DIAGNOSES`), #make this a 'wide' dataset rather than long
by = "MRN" #join on
)