根据条件连接行中的值
concatenating values from rows in based on criteria
我有一个数据框 df
(请参阅下面的代码),其中包含将近 100,000 行,显示了我的程序的联系人列表。该列表有一列显示联系人关联的程序 program
和组织 O_ID
,还有一列显示联系人在程序中的角色。只要联系人在多个程序中或在程序中具有多个角色,就会为该联系人创建另一行,其中程序和联系人角色字段值发生变化。
First Last C_ID OrgName O_ID Program Role
John Smith 10045 Acme 901 Buildings Primary
John Smith 10045 Acme 901 Buildings Communications
John Smith 10045 Acme 901 Homes Primary
Teddy Bush 10046 Acme 901 Buildings Primary
Teddy Bush 10046 Acme 901 Buildings Signatory
Jess Clinton 10050 Consult 904 Homes Signatory
Jess Clinton 10050 Consult 904 Homes Primary
Jess Clinton 10050 Consult 904 Homes Communications
出于演示目的,我尽量减少行数。具体来说,如果联系人在同一个组织和同一个程序中,我只希望联系人出现在一行中(目前是几行)并将联系人角色组合成一个字符串。
我试过这段代码,它部分有效:ddply(df,.(df$C_ID, df$Program, df$O_ID), paste, sep=",")
结果如下:
df$C_ID df$Program df$O_ID V1 V2
1 10045 Buildings 901 c("John", "John") c("Smith", "Smith")
2 10045 Homes 901 John Smith
3 10046 Buildings 901 c("Teddy", "Teddy") c("Bush", "Bush")
4 10050 Homes 904 c("Jess", "Jess", "Jess") c("Clinton", "Clinton", "Clinton")
V3 V4 V5 V6
1 c(10045, 10045) c("Acme", "Acme") c(901, 901) c("Buildings", "Buildings")
2 10045 Acme 901 Homes
3 c(10046, 10046) c("Acme", "Acme") c(901, 901) c("Buildings", "Buildings")
4 c(10050, 10050, 10050) c("Consult", "Consult", "Consult") c(904, 904, 904) c("Homes", "Homes", "Homes")
V7
1 c("Primary", "Communications")
2 Primary
3 c("Primary", "Signatory")
4 c("Signatory", "Primary", "Communications")
问题是
1) 重新排列了列(注意我的实际数据集中有更多的列)并且列名消失了
2) 唯一具有更改值的列应该在 Role
列中。但是,即使合并的值相同,结果也会合并大多数列的值。例如,在结果列 V1
(名字列)中,returns c("John", "John")
。它应该只是读作 "John"。唯一应该具有不同值的列是列 V7
c("Primary", "Communications")
df<-structure(list(First = c("John", "John", "John", "Teddy", "Teddy",
"Jess", "Jess", "Jess"), Last = c("Smith", "Smith", "Smith",
"Bush", "Bush", "Clinton", "Clinton", "Clinton"), C_ID = c(10045L,
10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L), OrgName = c("Acme",
"Acme", "Acme", "Acme", "Acme", "Consult", "Consult", "Consult"
), O_ID = c(901L, 901L, 901L, 901L, 901L, 904L, 904L, 904L),
Program = c("Buildings", "Buildings", "Homes", "Buildings",
"Buildings", "Homes", "Homes", "Homes"), Role = c("Primary",
"Communications", "Primary", "Primary", "Signatory", "Signatory",
"Primary", "Communications")), .Names = c("First", "Last",
"C_ID", "OrgName", "O_ID", "Program", "Role"), class = "data.frame", row.names = c(NA,
-8L))
您在 paste
中需要的是 collapse = ", "
,而不是 sep
。使用 collapse
从所有输入中创建一个字符串。为此,我对所有标识列(名称、组织、程序等)进行分组,然后折叠 summarise
.
中的角色
library(tidyverse)
df %>%
group_by(First, Last, C_ID, OrgName, O_ID, Program) %>%
summarise(roles_mult = paste(Role, collapse = ", "))
#> # A tibble: 4 x 7
#> # Groups: First, Last, C_ID, OrgName, O_ID [?]
#> First Last C_ID OrgName O_ID Program roles_mult
#> <chr> <chr> <int> <chr> <int> <chr> <chr>
#> 1 Jess Clinton 10050 Consult 904 Homes Signatory, Primary, Communi…
#> 2 John Smith 10045 Acme 901 Buildings Primary, Communications
#> 3 John Smith 10045 Acme 901 Homes Primary
#> 4 Teddy Bush 10046 Acme 901 Buildings Primary, Signatory
您也可以使用 dplyr
来完成。
> df %>% distinct(First, Last, .keep_all=T)
First Last C_ID OrgName O_ID Program Role
1 John Smith 10045 Acme 901 Buildings Primary
2 Teddy Bush 10046 Acme 901 Buildings Primary
3 Jess Clinton 10050 Consult 904 Homes Signatory
我有一个数据框 df
(请参阅下面的代码),其中包含将近 100,000 行,显示了我的程序的联系人列表。该列表有一列显示联系人关联的程序 program
和组织 O_ID
,还有一列显示联系人在程序中的角色。只要联系人在多个程序中或在程序中具有多个角色,就会为该联系人创建另一行,其中程序和联系人角色字段值发生变化。
First Last C_ID OrgName O_ID Program Role
John Smith 10045 Acme 901 Buildings Primary
John Smith 10045 Acme 901 Buildings Communications
John Smith 10045 Acme 901 Homes Primary
Teddy Bush 10046 Acme 901 Buildings Primary
Teddy Bush 10046 Acme 901 Buildings Signatory
Jess Clinton 10050 Consult 904 Homes Signatory
Jess Clinton 10050 Consult 904 Homes Primary
Jess Clinton 10050 Consult 904 Homes Communications
出于演示目的,我尽量减少行数。具体来说,如果联系人在同一个组织和同一个程序中,我只希望联系人出现在一行中(目前是几行)并将联系人角色组合成一个字符串。
我试过这段代码,它部分有效:ddply(df,.(df$C_ID, df$Program, df$O_ID), paste, sep=",")
结果如下:
df$C_ID df$Program df$O_ID V1 V2
1 10045 Buildings 901 c("John", "John") c("Smith", "Smith")
2 10045 Homes 901 John Smith
3 10046 Buildings 901 c("Teddy", "Teddy") c("Bush", "Bush")
4 10050 Homes 904 c("Jess", "Jess", "Jess") c("Clinton", "Clinton", "Clinton")
V3 V4 V5 V6
1 c(10045, 10045) c("Acme", "Acme") c(901, 901) c("Buildings", "Buildings")
2 10045 Acme 901 Homes
3 c(10046, 10046) c("Acme", "Acme") c(901, 901) c("Buildings", "Buildings")
4 c(10050, 10050, 10050) c("Consult", "Consult", "Consult") c(904, 904, 904) c("Homes", "Homes", "Homes")
V7
1 c("Primary", "Communications")
2 Primary
3 c("Primary", "Signatory")
4 c("Signatory", "Primary", "Communications")
问题是
1) 重新排列了列(注意我的实际数据集中有更多的列)并且列名消失了
2) 唯一具有更改值的列应该在 Role
列中。但是,即使合并的值相同,结果也会合并大多数列的值。例如,在结果列 V1
(名字列)中,returns c("John", "John")
。它应该只是读作 "John"。唯一应该具有不同值的列是列 V7
c("Primary", "Communications")
df<-structure(list(First = c("John", "John", "John", "Teddy", "Teddy",
"Jess", "Jess", "Jess"), Last = c("Smith", "Smith", "Smith",
"Bush", "Bush", "Clinton", "Clinton", "Clinton"), C_ID = c(10045L,
10045L, 10045L, 10046L, 10046L, 10050L, 10050L, 10050L), OrgName = c("Acme",
"Acme", "Acme", "Acme", "Acme", "Consult", "Consult", "Consult"
), O_ID = c(901L, 901L, 901L, 901L, 901L, 904L, 904L, 904L),
Program = c("Buildings", "Buildings", "Homes", "Buildings",
"Buildings", "Homes", "Homes", "Homes"), Role = c("Primary",
"Communications", "Primary", "Primary", "Signatory", "Signatory",
"Primary", "Communications")), .Names = c("First", "Last",
"C_ID", "OrgName", "O_ID", "Program", "Role"), class = "data.frame", row.names = c(NA,
-8L))
您在 paste
中需要的是 collapse = ", "
,而不是 sep
。使用 collapse
从所有输入中创建一个字符串。为此,我对所有标识列(名称、组织、程序等)进行分组,然后折叠 summarise
.
library(tidyverse)
df %>%
group_by(First, Last, C_ID, OrgName, O_ID, Program) %>%
summarise(roles_mult = paste(Role, collapse = ", "))
#> # A tibble: 4 x 7
#> # Groups: First, Last, C_ID, OrgName, O_ID [?]
#> First Last C_ID OrgName O_ID Program roles_mult
#> <chr> <chr> <int> <chr> <int> <chr> <chr>
#> 1 Jess Clinton 10050 Consult 904 Homes Signatory, Primary, Communi…
#> 2 John Smith 10045 Acme 901 Buildings Primary, Communications
#> 3 John Smith 10045 Acme 901 Homes Primary
#> 4 Teddy Bush 10046 Acme 901 Buildings Primary, Signatory
您也可以使用 dplyr
来完成。
> df %>% distinct(First, Last, .keep_all=T)
First Last C_ID OrgName O_ID Program Role
1 John Smith 10045 Acme 901 Buildings Primary
2 Teddy Bush 10046 Acme 901 Buildings Primary
3 Jess Clinton 10050 Consult 904 Homes Signatory