使用 gsub 将字符列值拆分为 4 个新值列并删除原始列的值
Split character column value into 4 new value columns using gsub and drop values of original column
我有一列包含这样的数组值:
[[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]]
我需要 4 个单独列中的最后两个(在本例中:[["5.1", "2"], ["90.2", "2"]])
但只有他们的价值观:
5.1
2
90.2
和 2
(在单独的列中)
我知道我可以像这里描述的那样用 tidyR 实现这个:split character data into numbers and letters
df %>%
separate(mycol,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])"
)
但到目前为止,每一次尝试和每一次尝试都失败了。我无法只访问最后 2 个(或 4 个)项目。
如果有任何想法,我将不胜感激。谢谢
我们可以按行分组 (rowwise
),然后将带有 fromJSON
的 'mycol' 元素转换为 matrix
的 list
,unlist
到 vector
,使用 as.data.frame.list
将向量转换为具有 4 列的 data.frame,将其包装在 list
中,然后我们 ungroup
和 unnest
list
列与 unnest_wider
(来自 tidyr
),最后,根据其值与 type.convert
转换列类型
library(dplyr)
library(jsonlite)
library(tidyr)
d %>%
rowwise %>%
mutate(newcol = list(setNames(as.data.frame.list(unlist(fromJSON(mycol,
simplifyVector = FALSE)[[2]] )), paste0("X", 1:4)))) %>%
ungroup %>%
unnest_wider(c(newcol)) %>%
type.convert(as.is = TRUE)
-输出
# A tibble: 3 x 5
# mycol X1 X2 X3 X4
# <chr> <dbl> <int> <dbl> <int>
#1 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]" 5.1 2 90.2 2
#2 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]" 5.1 2 90.2 2
#3 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]" 5.1 2 90.2 2
数据
d <- structure(list(mycol = c("[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]",
"[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]",
"[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]"
)), class = "data.frame", row.names = c(NA, -3L))
这是一个基于正则表达式和@akrun 数据的 base R
解决方案:
d1 <- sapply(strsplit(d$mycol, ","), function(x) gsub("(?!\.)\D", "", x, perl = T))
我们首先在逗号处拆分 d
并将结果传递给 gsub
函数,该函数删除任何非数字 (\D
) 而非 .
.我们 t
转换生成的数据帧 d1
以将列转换为行和 select 感兴趣的数据:
d2 <- as.data.frame(t(d1[5:8,]))
d2
V1 V2 V3 V4
1 5.1 2 90.2 2
2 5.1 2 90.2 2
3 5.1 2 90.2 2
如果您想将结果与原始数据放在一起,则cbind
并根据您的需要更改列名:
d3 <- cbind(d, d2)
names(d3) <- c("mycol", "x1", "x2", "x3", "x4")
结果:
d3
mycol x1 x2 x3 x4
1 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1 2 90.2 2
2 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1 2 90.2 2
3 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1 2 90.2 2
我有一列包含这样的数组值:
[[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]]
我需要 4 个单独列中的最后两个(在本例中:[["5.1", "2"], ["90.2", "2"]])
但只有他们的价值观:
5.1
2
90.2
和 2
(在单独的列中)
我知道我可以像这里描述的那样用 tidyR 实现这个:split character data into numbers and letters
df %>%
separate(mycol,
into = c("text", "num"),
sep = "(?<=[A-Za-z])(?=[0-9])"
)
但到目前为止,每一次尝试和每一次尝试都失败了。我无法只访问最后 2 个(或 4 个)项目。
如果有任何想法,我将不胜感激。谢谢
我们可以按行分组 (rowwise
),然后将带有 fromJSON
的 'mycol' 元素转换为 matrix
的 list
,unlist
到 vector
,使用 as.data.frame.list
将向量转换为具有 4 列的 data.frame,将其包装在 list
中,然后我们 ungroup
和 unnest
list
列与 unnest_wider
(来自 tidyr
),最后,根据其值与 type.convert
library(dplyr)
library(jsonlite)
library(tidyr)
d %>%
rowwise %>%
mutate(newcol = list(setNames(as.data.frame.list(unlist(fromJSON(mycol,
simplifyVector = FALSE)[[2]] )), paste0("X", 1:4)))) %>%
ungroup %>%
unnest_wider(c(newcol)) %>%
type.convert(as.is = TRUE)
-输出
# A tibble: 3 x 5
# mycol X1 X2 X3 X4
# <chr> <dbl> <int> <dbl> <int>
#1 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]" 5.1 2 90.2 2
#2 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]" 5.1 2 90.2 2
#3 "[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]" 5.1 2 90.2 2
数据
d <- structure(list(mycol = c("[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]",
"[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]",
"[[[\"0.10\", \"35\"], [\"0.2\", \"36\"]], [[\"5.1\", \"2\"], [\"90.2\", \"2\"]]]"
)), class = "data.frame", row.names = c(NA, -3L))
这是一个基于正则表达式和@akrun 数据的 base R
解决方案:
d1 <- sapply(strsplit(d$mycol, ","), function(x) gsub("(?!\.)\D", "", x, perl = T))
我们首先在逗号处拆分 d
并将结果传递给 gsub
函数,该函数删除任何非数字 (\D
) 而非 .
.我们 t
转换生成的数据帧 d1
以将列转换为行和 select 感兴趣的数据:
d2 <- as.data.frame(t(d1[5:8,]))
d2
V1 V2 V3 V4
1 5.1 2 90.2 2
2 5.1 2 90.2 2
3 5.1 2 90.2 2
如果您想将结果与原始数据放在一起,则cbind
并根据您的需要更改列名:
d3 <- cbind(d, d2)
names(d3) <- c("mycol", "x1", "x2", "x3", "x4")
结果:
d3
mycol x1 x2 x3 x4
1 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1 2 90.2 2
2 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1 2 90.2 2
3 [[["0.10", "35"], ["0.2", "36"]], [["5.1", "2"], ["90.2", "2"]]] 5.1 2 90.2 2