tidyr::pivot_longer 合并 2 组列,每组具有不同的名称前缀

tidyr::pivot_longer to merge 2 groups of columns, each group with a different name prefix

我有一个包含 543 个调查地点的森林砍伐数据的数据框。它有 20 列用于 2001-2020 年的值(X1、X2 等)和另外 20 列用于衡量同年的人口密度(列 pop01、pop02 等)。

> str(grid10b )
'data.frame':   543 obs. of  45 variables:
 $ X1       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ X2       : num  0.000889 0.000119 0.002048 0.00066 0.003605 ...
 $ X3       : num  0.004645 0.000612 0.007276 0.002608 0.003475 ...
 $ X4       : num  6.70e-04 8.07e-05 1.99e-03 1.19e-03 1.89e-03 ...
 $ X5       : num  0.001447 0.000183 0.00314 0.001687 0 ...
 $ X6       : num  0.000659 0.000115 0.002078 0.001113 0.000869 ...

...等我可以合并森林砍伐列(感谢这里的答案:

目前的代码是:

grid10a <- grid10a %>%
  tidyr::pivot_longer(cols = starts_with('X'), values_to = 'def') %>%
  group_by(id) %>%
  mutate(tstart = row_number(),
         tstop = tstart+1) %>%
  select(-name) # otherwise there's a column with X1, X2 etc which isn't needed

...这将具有森林砍伐值的 20 列合并为一列 'def',并为每个站点 ID 提供 20 行。到目前为止一切顺利。

但是如何合并人口密度列?我只需要将它们添加到 'population' 列中,因为它们与我刚刚整理的值处于同一年顺序。我需要排列 X1 和 pop01、X2 和 pop02 等的值。

接下来我尝试了这个:

grid10c <- grid10b %>%
  tidyr::pivot_longer(cols = starts_with('pop'), values_to = 'popn') %>% group_by(id2)

...但最终得到了 228,060 行的数据框!解决方案必须类似于此处的第一个答案:Reshaping multiple sets of measurement columns (wide format) into single columns (long format)

...但是 'names_to' 和 'names_sep' 的用法并没有真正解释。

这是我拥有的数据结构类型 (df1) 和我想要构建的数据结构类型 (df2) 的虚拟示例:

df1 <- data.frame(ID = seq(1, 543),
                  X1 = runif(543, 0, 1),
                  X2 = runif(543, 0, 1),
                  X3 = runif(543, 0, 1),
                  X4 = runif(543, 0, 1),
                  X5 = runif(543, 0, 1),
                  X6 = runif(543, 0, 1),
                  X7 = runif(543, 0, 1),
                  X8 = runif(543, 0, 1),
                  X9 = runif(543, 0, 1),
                  X10 = runif(543, 0, 1),
                  X11 = runif(543, 0, 1),
                  X12 = runif(543, 0, 1),
                  X13 = runif(543, 0, 1),
                  X14 = runif(543, 0, 1),
                  X15 = runif(543, 0, 1),
                  X16 = runif(543, 0, 1),
                  X17 = runif(543, 0, 1),
                  X18 = runif(543, 0, 1),
                  X19 = runif(543, 0, 1),
                  X20 = runif(543, 0, 1),
                  pop01 = runif(543, 0, 100),
                  pop02 = runif(543, 0, 100),
                  pop03 = runif(543, 0, 100),
                  pop04 = runif(543, 0, 100),
                  pop05 = runif(543, 0, 100),
                  pop06 = runif(543, 0, 100),
                  pop07 = runif(543, 0, 100),
                  pop08 = runif(543, 0, 100),
                  pop09 = runif(543, 0, 100),
                  pop10 = runif(543, 0, 100),
                  pop11 = runif(543, 0, 100),
                  pop12 = runif(543, 0, 100),
                  pop13 = runif(543, 0, 100),
                  pop14 = runif(543, 0, 100),
                  pop15 = runif(543, 0, 100),
                  pop16 = runif(543, 0, 100),
                  pop17 = runif(543, 0, 100),
                  pop18 = runif(543, 0, 100),
                  pop19 = runif(543, 0, 100),
                  pop20 = runif(543, 0, 100))
df2 <- data.frame(ID = rep(1:543,each = 20),
                  def = runif(10860, 0, 1),
                  popn = runif(10860 , 0, 100))

由于您需要两个基于两种不同测量类型名称的新“长”列,因此您需要 .valuenames_to.

然后最棘手的事情(对我来说)是定义 names_pattern 来告诉 R 如何创建新的列名。这里的列名基于以 Xpop 开头的字符串,尾随数字放在 year 列中。我将它们转换为 names_transform 中的整数,以绕过列名称中数字的问题(例如,X1pop01)。

我稍微缩小了你的例子,这样会更容易看到结果,但它对更多的列也同样有效。

总行数就是ID的数量*“年”数。

library(tidyr)

set.seed(16)
df1 <- data.frame(ID = seq(1, 4),
                  X1 = runif(4, 0, 1),
                  X2 = runif(4, 0, 1),
                  X3 = runif(4, 0, 1),
                  X4 = runif(4, 0, 1),
                  X5 = runif(4, 0, 1),
                  X6 = runif(4, 0, 1),
                  X7 = runif(4, 0, 1),
                  X8 = runif(4, 0, 1),
                  X9 = runif(4, 0, 1),
                  X10 = runif(4, 0, 1),
                  pop01 = runif(4, 0, 100),
                  pop02 = runif(4, 0, 100),
                  pop03 = runif(4, 0, 100),
                  pop04 = runif(4, 0, 100),
                  pop05 = runif(4, 0, 100),
                  pop06 = runif(4, 0, 100),
                  pop07 = runif(4, 0, 100),
                  pop08 = runif(4, 0, 100),
                  pop09 = runif(4, 0, 100),
                  pop10 = runif(4, 0, 100))


pivot_longer(df1, cols = -ID, 
             names_to = c(".value", "year"),
             names_pattern = "(X|pop)(.*)",
             names_transform = list(year = as.integer))
#> # A tibble: 40 x 4
#>       ID  year     X   pop
#>    <int> <int> <dbl> <dbl>
#>  1     1     1 0.683  4.97
#>  2     1     2 0.864 42.8 
#>  3     1     3 0.874 19.3 
#>  4     1     4 0.157 85.4 
#>  5     1     5 0.847 80.0 
#>  6     1     6 0.968 25.1 
#>  7     1     7 0.228 47.2 
#>  8     1     8 0.765 15.3 
#>  9     1     9 0.718 70.7 
#> 10     1    10 0.294 61.1 
#> # ... with 30 more rows

reprex package (v2.0.0)

于 2021-07-07 创建

如果您不需要“年份”列,您可以通过 dplyr::select() 删除。您可以通过 dplyr::rename() 重命名您的两个新列。或者,您可以在转换为 long 之前将 X 更改为更有意义的内容。例如,使用:

names(df1) <- sub("X", "def", names(df1))