根据子字符串从无组织的数据创建列
Creating columns from unorganized data based on substrings
我的论文数据遇到了以下问题。在第一列 "id" 之后,我有一个带有水平无组织字符串单元格的数据框。我想在行内组织字符串,以便所有以相同的前 4 个字符开头的字符串都保留在同一列中。
由于相关类别的数量有限(少于 20 个),我可以手动执行此操作,首先是 "Arra",然后是 "Comm",依此类推。我用 grepl
尝试了这个,但未能 return 原始的单元格字符串。我只有 TRUE/FALSE。非常感谢您的帮助!
我目前的数据是这样的。 (我将 NA 单元格留空)
id col2 col3 col4 col5
3 Commitment 100 Lead Mgmt 15 Arranger 50
8 Arrangement 20 Front-end 80
16 Lead mgmt 40 Commitmnt 20
17
20 Arranger 50
它应该是这样的:
id Arra Comm Fron Lead
3 Arranger 50 Commitment 100 Lead Mgmt 15
8 Arrangement 20 Front-end 80
16 Commitmnt 20 Lead mgmt 40
17
20 Arranger 50
这是一种可能的方法:
library(data.table)
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
, ind := substr(value, 1, 4)], id ~ ind, value.var = "value", fill = "")
# id Arra Comm Fron Lead
# 1: 3 Arranger 50 Commitment 100 Lead Mgmt 15
# 2: 8 Arrangement 20 Front-end 80
# 3: 16 Commitmnt 20 Lead mgmt 40
# 4: 20 Arranger 50
并且,使用类似的逻辑,在 "tidyverse":
library(tidyverse)
mydf[is.na(mydf)] <- ""
mydf %>%
gather(var, val, starts_with("col")) %>%
filter(val != "") %>%
mutate(ind = substr(val, 1, 4)) %>%
select(-var) %>%
spread(ind, val)
# id Arra Comm Fron Lead
# 1 3 Arranger 50 Commitment 100 <NA> Lead Mgmt 15
# 2 8 Arrangement 20 <NA> Front-end 80 <NA>
# 3 16 <NA> Commitmnt 20 <NA> Lead mgmt 40
# 4 20 Arranger 50 <NA> <NA> <NA>
示例数据:
mydf <- structure(list(id = c(3L, 8L, 16L, 17L, 20L), col2 = c("Commitment 100",
"Arrangement 20", "Lead mgmt 40", "", "Arranger 50"), col3 = c("Lead Mgmt 15",
"Front-end 80", "Commitmnt 20", "", ""), col4 = c("Arranger 50",
"", "", "", ""), col5 = c(NA, NA, NA, NA, NA)), .Names = c("id",
"col2", "col3", "col4", "col5"), row.names = c(NA, 5L), class = "data.frame")
如果您的原始数据中有重复的存根,例如,如果第 1 行中的 "col5" 有另一个 "commitment" 值:
mydf$col5[1] <- "Commitment 99"
你可以试试这样:
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
, ind := substr(value, 1, 4)],
id ~ ind + rowid(id, ind), value.var = "value", fill = "")
# id Arra_1 Comm_1 Comm_2 Fron_1 Lead_1
# 1: 3 Arranger 50 Commitment 100 Commitment 99 Lead Mgmt 15
# 2: 8 Arrangement 20 Front-end 80
# 3: 16 Commitmnt 20 Lead mgmt 40
# 4: 20 Arranger 50
或者这个:
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
, ind := substr(value, 1, 4)],
id ~ ind, value.var = "value", fun = function(x) x[1], fill = "")
# id Arra Comm Fron Lead
# 1: 3 Arranger 50 Commitment 100 Lead Mgmt 15
# 2: 8 Arrangement 20 Front-end 80
# 3: 16 Commitmnt 20 Lead mgmt 40
# 4: 20 Arranger 50
取决于您想要的输出。
我的论文数据遇到了以下问题。在第一列 "id" 之后,我有一个带有水平无组织字符串单元格的数据框。我想在行内组织字符串,以便所有以相同的前 4 个字符开头的字符串都保留在同一列中。
由于相关类别的数量有限(少于 20 个),我可以手动执行此操作,首先是 "Arra",然后是 "Comm",依此类推。我用 grepl
尝试了这个,但未能 return 原始的单元格字符串。我只有 TRUE/FALSE。非常感谢您的帮助!
我目前的数据是这样的。 (我将 NA 单元格留空)
id col2 col3 col4 col5
3 Commitment 100 Lead Mgmt 15 Arranger 50
8 Arrangement 20 Front-end 80
16 Lead mgmt 40 Commitmnt 20
17
20 Arranger 50
它应该是这样的:
id Arra Comm Fron Lead
3 Arranger 50 Commitment 100 Lead Mgmt 15
8 Arrangement 20 Front-end 80
16 Commitmnt 20 Lead mgmt 40
17
20 Arranger 50
这是一种可能的方法:
library(data.table)
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
, ind := substr(value, 1, 4)], id ~ ind, value.var = "value", fill = "")
# id Arra Comm Fron Lead
# 1: 3 Arranger 50 Commitment 100 Lead Mgmt 15
# 2: 8 Arrangement 20 Front-end 80
# 3: 16 Commitmnt 20 Lead mgmt 40
# 4: 20 Arranger 50
并且,使用类似的逻辑,在 "tidyverse":
library(tidyverse)
mydf[is.na(mydf)] <- ""
mydf %>%
gather(var, val, starts_with("col")) %>%
filter(val != "") %>%
mutate(ind = substr(val, 1, 4)) %>%
select(-var) %>%
spread(ind, val)
# id Arra Comm Fron Lead
# 1 3 Arranger 50 Commitment 100 <NA> Lead Mgmt 15
# 2 8 Arrangement 20 <NA> Front-end 80 <NA>
# 3 16 <NA> Commitmnt 20 <NA> Lead mgmt 40
# 4 20 Arranger 50 <NA> <NA> <NA>
示例数据:
mydf <- structure(list(id = c(3L, 8L, 16L, 17L, 20L), col2 = c("Commitment 100",
"Arrangement 20", "Lead mgmt 40", "", "Arranger 50"), col3 = c("Lead Mgmt 15",
"Front-end 80", "Commitmnt 20", "", ""), col4 = c("Arranger 50",
"", "", "", ""), col5 = c(NA, NA, NA, NA, NA)), .Names = c("id",
"col2", "col3", "col4", "col5"), row.names = c(NA, 5L), class = "data.frame")
如果您的原始数据中有重复的存根,例如,如果第 1 行中的 "col5" 有另一个 "commitment" 值:
mydf$col5[1] <- "Commitment 99"
你可以试试这样:
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
, ind := substr(value, 1, 4)],
id ~ ind + rowid(id, ind), value.var = "value", fill = "")
# id Arra_1 Comm_1 Comm_2 Fron_1 Lead_1
# 1: 3 Arranger 50 Commitment 100 Commitment 99 Lead Mgmt 15
# 2: 8 Arrangement 20 Front-end 80
# 3: 16 Commitmnt 20 Lead mgmt 40
# 4: 20 Arranger 50
或者这个:
dcast(melt(as.data.table(mydf), "id", na.rm = TRUE)[value != ""][
, ind := substr(value, 1, 4)],
id ~ ind, value.var = "value", fun = function(x) x[1], fill = "")
# id Arra Comm Fron Lead
# 1: 3 Arranger 50 Commitment 100 Lead Mgmt 15
# 2: 8 Arrangement 20 Front-end 80
# 3: 16 Commitmnt 20 Lead mgmt 40
# 4: 20 Arranger 50
取决于您想要的输出。