提取列名称中给定子字符串的所有唯一值

Question

我有一个数据框，其中列名的结构如下：

Barcelona.Standard.2012.True
Berlin.One.2013.True
London.One.2014.True
Barcelona.Standard.2015.True
Berlin.One.2016.True

如您所见，每一列指定 City、Type of bank account、Year it was open 以及是否 Active。

我想将每个类别的所有不同可能性提取到一个列表中。例如，对于第一个类别，即城市，我们将得到：

Barcelona Berlin London

我想做的是按符号 . 拆分，然后获取所有列的给定位置的所有唯一值。

我可以用循环来做，但如果可能的话，我想用 sapply 来做。

类似于：

lapply(strsplit(colnames(dat), split = "\.")[[]][1])

其中 [[]] 将是所有列。

这只是一个玩具示例，真实数据集有数千列。

Answer 1

与sapply:

sapply(transpose(strsplit(col, "\.")), function(x) unlist(unique(x), recursive = F))

或者使用 data.table::transpose 而不是转置以使其更容易：

sapply(data.table::transpose(strsplit(col, "\.")), unique)

最后，使用setNames设置名称：

sapply(transpose(strsplit(col, "\.")), function(x) unlist(unique(x), recursive = F)) |>
  setNames(c("City", "Type", "Year", "Active"))

输出:

$City
[1] "Barcelona" "Berlin"    "London"   

$Type
[1] "Standard" "One"     

$Year
[1] "2012" "2013" "2014" "2015" "2016"

$Active
[1] "True"

数据

col <- c("Barcelona.Standard.2012.True",
  "Berlin.One.2013.True",
  "London.One.2014.True",
  "Barcelona.Standard.2015.True",
  "Berlin.One.2016.True")

Answer 2

这是一种使用 tidyr::separate 和新基础 R >= 4.1 管道运算符和 lambda 的方法。

tidyr::separate(
  dat, x, 
  into = c("City", "Type", "Year", "Active"),
  sep = "[^[:alnum:]]"
) |>
  as.list() |>
  (\(x) Map(unique, x))()
#> $City
#> [1] "Barcelona" "Berlin"    "London"   
#> 
#> $Type
#> [1] "Standard" "One"     
#> 
#> $Year
#> [1] "2012" "2013" "2014" "2015" "2016"
#> 
#> $Active
#> [1] "True"

^{由 reprex package (v2.0.1)}

于 2022-02-14 创建

编辑

更简单，用 lapply 代替 Map。

tidyr::separate(
  dat, x, 
  into = c("City", "Type", "Year", "Active"),
  sep = "[^[:alnum:]]"
) |>
  (\(x) lapply(x, unique))()
#> $City
#> [1] "Barcelona" "Berlin"    "London"   
#> 
#> $Type
#> [1] "Standard" "One"     
#> 
#> $Year
#> [1] "2012" "2013" "2014" "2015" "2016"
#> 
#> $Active
#> [1] "True"

^{由 reprex package (v2.0.1)}

于 2022-02-14 创建

数据

x <- scan(text = "
Barcelona.Standard.2012.True
Berlin.One.2013.True
London.One|2014.True
Barcelona.Standard.2015.True
Berlin.One.2016.True", what = character())

dat <- data.frame(x)

^{由 reprex package (v2.0.1)}

于 2022-02-14 创建

提取列名称中给定子字符串的所有唯一值

Extract all the unique values for a given substring in the column names

r

apply

data-wrangling

编辑

数据