在过滤后的字符上拆分数据框并制作多个新列
Split dataframe on filtered character and make multiple new columns
我有一个数据预处理问题,这在我的工作中很常见。我通常有两个文件,最后我想对其进行大型匹配操作。
这通常是一个两步过程,第一步涉及制作第一个文件的 "cleaned" 数据帧,第二步是与更大数据帧的第二个文件进行匹配(vlookup)。我需要帮助解决这个问题的第一步。
我在下面创建了一个简单的示例来进行处理。
我的简化数据框:
c1 <- 1:15
c2 <- c("Valuelabels", "V1", "1", "2", "Valuelabels", "V2", "1", "2", "3", "Valuelabels", "V3", "1", "2", "3", "4")
c3 <- c("", "", "Male", "Female", "", "", "Married", "Single", "Other", "", "", "SingleWithChildren", "SingleWithoutChildren","MarriedWithChildren", "PartneredWithChildren")
df <- data.frame(row.names =c1,c2,c3)
df
c2 c3
1 Valuelabels
2 V1
3 1 Male
4 2 Female
5 Valuelabels
6 V2
7 1 Married
8 2 Single
9 3 Other
10 Valuelabels
11 V3
12 1 SingleWithChildren
13 2 SingleWithoutChildren
14 3 MarriedWithChildren
15 4 PartneredWithChildren
现在,我想在第一列的 "Valuelabel" 字符串上拆分数据框,最后得到一个如下所示的新数据框:
V1 V1_match V2 V2_match V3 V3_match
1: 1 Male 1 Married 1 SingleWithChildren
2: 2 Female 2 Single 2 SingleWithoutChildren
3: NA 3 Other 3 MarriedWithChildren
4: NA NA 4 PartneredWithChildren
最后我想创建一个数据框,其中 V1 作为列名,并将这些值下的匹配值作为我示例中命名的新列 V1_match... 以此类推 V2 到V3.
此数据框将在与更大的数据框匹配之前结束我的第一步。
非常感谢您的帮助。
这是一个可能的 data.table
解决方案
library(data.table) # v 1.9.5
setDT(df)[, indx := c2[2L], by = cumsum(c2 == "Valuelabels")]
df2 <- df[!grepl("\D", c2)][, indx2 := seq_len(.N), by = indx]
dcast(df2, indx2 ~ indx, value.var = c("c2", "c3"))
# indx2 V1_c2 V2_c2 V3_c2 V1_c3 V2_c3 V3_c3
# 1: 1 1 1 1 Male Married SingleWithChildren
# 2: 2 2 2 2 Female Single SingleWithoutChildren
# 3: 3 NA 3 3 NA Other MarriedWithChildren
# 4: 4 NA NA 4 NA NA PartneredWithChildren
您需要安装 data.table
v > 1.9.5 才能运行 使用
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
另一种方法基础R
:
lst = lapply(split(df,cumsum(df$c2=='Valuelabels')), tail, -2)
Reduce(function(u,v) merge(u,v,by='c2',all=T), lst)
# c2 c3.x c3.y c3
#1 1 Male Married SingleWithChildren
#2 2 Female Single SingleWithoutChildren
#3 3 <NA> Other MarriedWithChildren
#4 4 <NA> <NA> PartneredWithChildren
我有一个数据预处理问题,这在我的工作中很常见。我通常有两个文件,最后我想对其进行大型匹配操作。 这通常是一个两步过程,第一步涉及制作第一个文件的 "cleaned" 数据帧,第二步是与更大数据帧的第二个文件进行匹配(vlookup)。我需要帮助解决这个问题的第一步。 我在下面创建了一个简单的示例来进行处理。 我的简化数据框:
c1 <- 1:15
c2 <- c("Valuelabels", "V1", "1", "2", "Valuelabels", "V2", "1", "2", "3", "Valuelabels", "V3", "1", "2", "3", "4")
c3 <- c("", "", "Male", "Female", "", "", "Married", "Single", "Other", "", "", "SingleWithChildren", "SingleWithoutChildren","MarriedWithChildren", "PartneredWithChildren")
df <- data.frame(row.names =c1,c2,c3)
df
c2 c3
1 Valuelabels
2 V1
3 1 Male
4 2 Female
5 Valuelabels
6 V2
7 1 Married
8 2 Single
9 3 Other
10 Valuelabels
11 V3
12 1 SingleWithChildren
13 2 SingleWithoutChildren
14 3 MarriedWithChildren
15 4 PartneredWithChildren
现在,我想在第一列的 "Valuelabel" 字符串上拆分数据框,最后得到一个如下所示的新数据框:
V1 V1_match V2 V2_match V3 V3_match
1: 1 Male 1 Married 1 SingleWithChildren
2: 2 Female 2 Single 2 SingleWithoutChildren
3: NA 3 Other 3 MarriedWithChildren
4: NA NA 4 PartneredWithChildren
最后我想创建一个数据框,其中 V1 作为列名,并将这些值下的匹配值作为我示例中命名的新列 V1_match... 以此类推 V2 到V3.
此数据框将在与更大的数据框匹配之前结束我的第一步。
非常感谢您的帮助。
这是一个可能的 data.table
解决方案
library(data.table) # v 1.9.5
setDT(df)[, indx := c2[2L], by = cumsum(c2 == "Valuelabels")]
df2 <- df[!grepl("\D", c2)][, indx2 := seq_len(.N), by = indx]
dcast(df2, indx2 ~ indx, value.var = c("c2", "c3"))
# indx2 V1_c2 V2_c2 V3_c2 V1_c3 V2_c3 V3_c3
# 1: 1 1 1 1 Male Married SingleWithChildren
# 2: 2 2 2 2 Female Single SingleWithoutChildren
# 3: 3 NA 3 3 NA Other MarriedWithChildren
# 4: 4 NA NA 4 NA NA PartneredWithChildren
您需要安装 data.table
v > 1.9.5 才能运行 使用
library(devtools)
install_github("Rdatatable/data.table", build_vignettes = FALSE)
另一种方法基础R
:
lst = lapply(split(df,cumsum(df$c2=='Valuelabels')), tail, -2)
Reduce(function(u,v) merge(u,v,by='c2',all=T), lst)
# c2 c3.x c3.y c3
#1 1 Male Married SingleWithChildren
#2 2 Female Single SingleWithoutChildren
#3 3 <NA> Other MarriedWithChildren
#4 4 <NA> <NA> PartneredWithChildren