改变列,使基本名称排列在一起
Mutate columns such that basenames line up together
假设我有一个文件路径向量,我已将其按 "/"
拆分并放入数据帧中。这些文件路径的长度各不相同,但归根结底,我希望所有基本名称都排在同一列中。我在下面包含了我的意思和所需输出的示例。
library(tidyverse)
dat <- tibble(
V1 = rep("run1", 5),
V2 = rep("ox", 5),
V3 = c("performance.csv", "analysis", "analysis", "performance.csv", "analysis"),
V4 = c("", "rod1", "rod2", "rod3", "performance.csv"),
V5 = c("", "performance.csv", "performance.csv", "performance.csv", "")
)
dat
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 run1 ox performance.csv "" ""
#> 2 run1 ox analysis rod1 performance.csv
#> 3 run1 ox analysis rod2 performance.csv
#> 4 run1 ox performance.csv rod3 performance.csv
#> 5 run1 ox analysis performance.csv ""
output <- tibble(
V1 = rep("run1", 5),
V2 = rep("ox", 5),
V3 = c("", "analysis", "analysis", "", "analysis"),
V4 = c("", "rod1", "rod1", "rod2", ""),
V5 = c("performance.csv", "performance.csv", "performance.csv", "performance.csv", "performance.csv")
)
output
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 run1 ox "" "" performance.csv
#> 2 run1 ox analysis rod1 performance.csv
#> 3 run1 ox analysis rod1 performance.csv
#> 4 run1 ox "" rod2 performance.csv
#> 5 run1 ox analysis "" performance.csv
我的想法是求助于一个 for 循环,在该循环中我检查列是否包含基本名称,如果包含,则将其替换为 ""
并将其移至最后一列。我在形成这种逻辑时遇到了麻烦,并且知道必须有更好的方法来利用 tidyverse。
这是一个tidyverse
方式-
dat %>%
rownames_to_column("id") %>%
gather(key, variable, -id) %>%
group_by(id) %>%
mutate(
variable = case_when(
key == "V5" ~ tail(grep(".csv", x = variable, value = T), 1),
key != "V5" & grepl(".csv", x = variable) ~ "",
TRUE ~ variable
)
) %>%
ungroup() %>%
spread(key, variable)
# A tibble: 5 x 6
id V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 run1 ox "" "" performance.csv
2 2 run1 ox analysis rod1 performance.csv
3 3 run1 ox analysis rod2 performance.csv
4 4 run1 ox "" rod3 performance.csv
5 5 run1 ox analysis "" performance.csv
创建一个函数 rearrange
重新排列一行,将基本名称放在末尾,如果它不在末尾,则将其原始位置消隐。我们假设任何带点的条目都是基本名称。然后将 rearrange
应用到每一行。
rearrange <- function(x) {
i <- grep(".", x, fixed = TRUE)[1]
x[length(x)] <- x[i]
if (i < length(x)) x[i] <- ""
x
}
as_tibble(t(apply(dat, 1, rearrange)))
给予:
# A tibble: 5 x 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 run1 ox "" "" performance.csv
2 run1 ox analysis rod1 performance.csv
3 run1 ox analysis rod2 performance.csv
4 run1 ox "" rod3 performance.csv
5 run1 ox analysis "" performance.csv
base R
使用 max.col
的选项。获取数据集子集(第 3 到第 5 列)的列索引,其中有 .
作为元素,cbind
具有行索引(seq_len(nrow(dat))
),从数据集中提取元素基于这些索引并将其分配给 'V5'。然后根据逻辑矩阵(do.call(cbind, .
)的TRUE值将第3列和第4列改为空白(""
)
dat <- as.data.frame(dat)
lst1 <- lapply(dat[3:5], grepl, pattern = '\.')
ij <- cbind(seq_len(nrow(dat)), max.col(do.call(cbind, lst1), 'first'))
dat$V5 <- dat[3:5][ij]
dat[3:4][do.call(cbind, lst1[1:2])] <- ""
dat
# V1 V2 V3 V4 V5
#1 run1 ox performance.csv
#2 run1 ox analysis rod1 performance.csv
#3 run1 ox analysis rod2 performance.csv
#4 run1 ox rod3 performance.csv
#5 run1 ox analysis performance.csv
或使用 tidyverse
和 coalesce
。在这里,我们 select
列 'V3' 到 'V5',遍历列 (map
),replace
不是 .csv
的元素 NA
, coalesce
它到一个列,将该列与原始数据集的子集列和 replace
具有 .
空白的第 3 到第 4 列绑定(""
)
library(tidyverse)
dat %>%
select(V3:V5) %>%
map_df(~ replace(.x, str_detect(.x, "\.csv", negate = TRUE), NA)) %>%
transmute(V5 = coalesce(!!! .)) %>%
bind_cols(dat %>%
select(-V5), .) %>%
mutate_at(vars(3:4), list(~ replace(., str_detect(., "\."), '')))
# A tibble: 5 x 5
# V1 V2 V3 V4 V5
# <chr> <chr> <chr> <chr> <chr>
#1 run1 ox "" "" performance.csv
#2 run1 ox analysis rod1 performance.csv
#3 run1 ox analysis rod2 performance.csv
#4 run1 ox "" rod3 performance.csv
#5 run1 ox analysis "" performance.csv
假设我有一个文件路径向量,我已将其按 "/"
拆分并放入数据帧中。这些文件路径的长度各不相同,但归根结底,我希望所有基本名称都排在同一列中。我在下面包含了我的意思和所需输出的示例。
library(tidyverse)
dat <- tibble(
V1 = rep("run1", 5),
V2 = rep("ox", 5),
V3 = c("performance.csv", "analysis", "analysis", "performance.csv", "analysis"),
V4 = c("", "rod1", "rod2", "rod3", "performance.csv"),
V5 = c("", "performance.csv", "performance.csv", "performance.csv", "")
)
dat
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 run1 ox performance.csv "" ""
#> 2 run1 ox analysis rod1 performance.csv
#> 3 run1 ox analysis rod2 performance.csv
#> 4 run1 ox performance.csv rod3 performance.csv
#> 5 run1 ox analysis performance.csv ""
output <- tibble(
V1 = rep("run1", 5),
V2 = rep("ox", 5),
V3 = c("", "analysis", "analysis", "", "analysis"),
V4 = c("", "rod1", "rod1", "rod2", ""),
V5 = c("performance.csv", "performance.csv", "performance.csv", "performance.csv", "performance.csv")
)
output
#> # A tibble: 5 x 5
#> V1 V2 V3 V4 V5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 run1 ox "" "" performance.csv
#> 2 run1 ox analysis rod1 performance.csv
#> 3 run1 ox analysis rod1 performance.csv
#> 4 run1 ox "" rod2 performance.csv
#> 5 run1 ox analysis "" performance.csv
我的想法是求助于一个 for 循环,在该循环中我检查列是否包含基本名称,如果包含,则将其替换为 ""
并将其移至最后一列。我在形成这种逻辑时遇到了麻烦,并且知道必须有更好的方法来利用 tidyverse。
这是一个tidyverse
方式-
dat %>%
rownames_to_column("id") %>%
gather(key, variable, -id) %>%
group_by(id) %>%
mutate(
variable = case_when(
key == "V5" ~ tail(grep(".csv", x = variable, value = T), 1),
key != "V5" & grepl(".csv", x = variable) ~ "",
TRUE ~ variable
)
) %>%
ungroup() %>%
spread(key, variable)
# A tibble: 5 x 6
id V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr> <chr>
1 1 run1 ox "" "" performance.csv
2 2 run1 ox analysis rod1 performance.csv
3 3 run1 ox analysis rod2 performance.csv
4 4 run1 ox "" rod3 performance.csv
5 5 run1 ox analysis "" performance.csv
创建一个函数 rearrange
重新排列一行,将基本名称放在末尾,如果它不在末尾,则将其原始位置消隐。我们假设任何带点的条目都是基本名称。然后将 rearrange
应用到每一行。
rearrange <- function(x) {
i <- grep(".", x, fixed = TRUE)[1]
x[length(x)] <- x[i]
if (i < length(x)) x[i] <- ""
x
}
as_tibble(t(apply(dat, 1, rearrange)))
给予:
# A tibble: 5 x 5
V1 V2 V3 V4 V5
<chr> <chr> <chr> <chr> <chr>
1 run1 ox "" "" performance.csv
2 run1 ox analysis rod1 performance.csv
3 run1 ox analysis rod2 performance.csv
4 run1 ox "" rod3 performance.csv
5 run1 ox analysis "" performance.csv
base R
使用 max.col
的选项。获取数据集子集(第 3 到第 5 列)的列索引,其中有 .
作为元素,cbind
具有行索引(seq_len(nrow(dat))
),从数据集中提取元素基于这些索引并将其分配给 'V5'。然后根据逻辑矩阵(do.call(cbind, .
)的TRUE值将第3列和第4列改为空白(""
)
dat <- as.data.frame(dat)
lst1 <- lapply(dat[3:5], grepl, pattern = '\.')
ij <- cbind(seq_len(nrow(dat)), max.col(do.call(cbind, lst1), 'first'))
dat$V5 <- dat[3:5][ij]
dat[3:4][do.call(cbind, lst1[1:2])] <- ""
dat
# V1 V2 V3 V4 V5
#1 run1 ox performance.csv
#2 run1 ox analysis rod1 performance.csv
#3 run1 ox analysis rod2 performance.csv
#4 run1 ox rod3 performance.csv
#5 run1 ox analysis performance.csv
或使用 tidyverse
和 coalesce
。在这里,我们 select
列 'V3' 到 'V5',遍历列 (map
),replace
不是 .csv
的元素 NA
, coalesce
它到一个列,将该列与原始数据集的子集列和 replace
具有 .
空白的第 3 到第 4 列绑定(""
)
library(tidyverse)
dat %>%
select(V3:V5) %>%
map_df(~ replace(.x, str_detect(.x, "\.csv", negate = TRUE), NA)) %>%
transmute(V5 = coalesce(!!! .)) %>%
bind_cols(dat %>%
select(-V5), .) %>%
mutate_at(vars(3:4), list(~ replace(., str_detect(., "\."), '')))
# A tibble: 5 x 5
# V1 V2 V3 V4 V5
# <chr> <chr> <chr> <chr> <chr>
#1 run1 ox "" "" performance.csv
#2 run1 ox analysis rod1 performance.csv
#3 run1 ox analysis rod2 performance.csv
#4 run1 ox "" rod3 performance.csv
#5 run1 ox analysis "" performance.csv