将字符串分成未知数的新列
Separate string into new columns of unknown number
我有一个如下所示的数据集:
data = tibble(emp = c(1:4),
idstring = c("PER20384|PER49576|PER10837|PER92641",
"PER20384|PER49576|PER03875|PER72534",
"PER20384|PER98642|PER17134",
"PER20384|PER98623|PER17134|PER01836|PER1234"))
我想用“|”分隔idstring到单独的列中。但是,我需要最右边的字符(例如“PER92641”)始终位于标记为“Level_1”的列中,最左边的字符根据行中的字符数而变化。
我尝试了一些基本步骤,例如:
data_split = str_split(data$idstring, "\|", simplify = T)
colnames(data_split) = paste0("Level_", ncol(data_split):1)
但是我得到这样的错误输出:
Level_5 Level_4 Level_3 Level_2 Level_1
[1,] "PER20384" "PER49576" "PER10837" "PER92641" ""
[2,] "PER20384" "PER49576" "PER03875" "PER72534" ""
[3,] "PER20384" "PER98642" "PER17134" "" ""
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
它应该是这样的:
Level_5 Level_4 Level_3 Level_2 Level_1
[1,] NA "PER20384" "PER49576" "PER10837" "PER92641"
[2,] NA "PER20384" "PER49576" "PER03875" "PER72534"
[3,] NA NA "PER20384" "PER98642" "PER17134"
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
请注意,理想情况下,我也希望在适用的地方用 NA 代替空白。
我觉得我可以以某种方式颠倒每一行的顺序,然后用 NA 替换空格,然后再添加列名,但我希望在这里可以找到更优雅的解决方案。
这可以通过对 NA 值执行 order
来完成。我们把|
处的'idstring'拆分成一个list
,得到list
个元素的max
lengths
('mx')。使用它来用 length<-
填充 NA
(默认情况下它在末尾而不是开头填充),然后我们 order
基于 NA 元素的向量,rbind
list
个元素
lst1 <- strsplit(data$idstring, "|", fixed = TRUE)
mx <- max(lengths(lst1))
out <- do.call(rbind, lapply(lst1, function(x) {
length(x) <- mx
x[order(!is.na(x))]
}))
colnames(out) <- paste0("Level_", ncol(out):1)
-输出
# Level_5 Level_4 Level_3 Level_2 Level_1
#[1,] NA "PER20384" "PER49576" "PER10837" "PER92641"
#[2,] NA "PER20384" "PER49576" "PER03875" "PER72534"
#[3,] NA NA "PER20384" "PER98642" "PER17134"
#[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
或者另一种选择是使用read.table
读取列,然后通过重新排列前面的NA元素来修改行值
d1 <- read.table(text = data$idstring, sep="|", header = FALSE,
fill = TRUE, na.strings = c(""), col.names = paste0('Level_', 5:1))
d1[1] <- t(apply(d1, 1, function(x) c(x[is.na(x)], x[!is.na(x)])))
我有一个如下所示的数据集:
data = tibble(emp = c(1:4),
idstring = c("PER20384|PER49576|PER10837|PER92641",
"PER20384|PER49576|PER03875|PER72534",
"PER20384|PER98642|PER17134",
"PER20384|PER98623|PER17134|PER01836|PER1234"))
我想用“|”分隔idstring到单独的列中。但是,我需要最右边的字符(例如“PER92641”)始终位于标记为“Level_1”的列中,最左边的字符根据行中的字符数而变化。
我尝试了一些基本步骤,例如:
data_split = str_split(data$idstring, "\|", simplify = T)
colnames(data_split) = paste0("Level_", ncol(data_split):1)
但是我得到这样的错误输出:
Level_5 Level_4 Level_3 Level_2 Level_1
[1,] "PER20384" "PER49576" "PER10837" "PER92641" ""
[2,] "PER20384" "PER49576" "PER03875" "PER72534" ""
[3,] "PER20384" "PER98642" "PER17134" "" ""
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
它应该是这样的:
Level_5 Level_4 Level_3 Level_2 Level_1
[1,] NA "PER20384" "PER49576" "PER10837" "PER92641"
[2,] NA "PER20384" "PER49576" "PER03875" "PER72534"
[3,] NA NA "PER20384" "PER98642" "PER17134"
[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
请注意,理想情况下,我也希望在适用的地方用 NA 代替空白。
我觉得我可以以某种方式颠倒每一行的顺序,然后用 NA 替换空格,然后再添加列名,但我希望在这里可以找到更优雅的解决方案。
这可以通过对 NA 值执行 order
来完成。我们把|
处的'idstring'拆分成一个list
,得到list
个元素的max
lengths
('mx')。使用它来用 length<-
填充 NA
(默认情况下它在末尾而不是开头填充),然后我们 order
基于 NA 元素的向量,rbind
list
个元素
lst1 <- strsplit(data$idstring, "|", fixed = TRUE)
mx <- max(lengths(lst1))
out <- do.call(rbind, lapply(lst1, function(x) {
length(x) <- mx
x[order(!is.na(x))]
}))
colnames(out) <- paste0("Level_", ncol(out):1)
-输出
# Level_5 Level_4 Level_3 Level_2 Level_1
#[1,] NA "PER20384" "PER49576" "PER10837" "PER92641"
#[2,] NA "PER20384" "PER49576" "PER03875" "PER72534"
#[3,] NA NA "PER20384" "PER98642" "PER17134"
#[4,] "PER20384" "PER98623" "PER17134" "PER01836" "PER1234"
或者另一种选择是使用read.table
读取列,然后通过重新排列前面的NA元素来修改行值
d1 <- read.table(text = data$idstring, sep="|", header = FALSE,
fill = TRUE, na.strings = c(""), col.names = paste0('Level_', 5:1))
d1[1] <- t(apply(d1, 1, function(x) c(x[is.na(x)], x[!is.na(x)])))