stacking/melting R 中的多列到多列
stacking/melting multiple columns into multiple columns in R
我正在尝试 melt/stack/gather 将数据框的多个特定列分成 2 列,同时保留所有其他列。
我在 Whosebug 上尝试了很多很多答案但没有成功(下面的一些)。我这里基本上有类似 post 的情况:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
只有更多的列可以保留和组合。重要的是要提到我的年份列是因素,我的列比下面列出的示例多得多,所以我想调用列名而不是位置。
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
我在上面的post的基础上尝试在tidyr中使用pivot_longer,看起来很相似,这导致了各种错误,具体取决于我的操作:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
我还在 reshape2 中以各种方式玩过 melt,这些方法要么只熔化值列,要么只熔化年份列。如:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
我也尝试过基于其他 posts 的 dplyr gather,但我发现很难理解帮助页面和 posts。
明确我想要实现的目标:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
非常感谢这里的帮助。
我们可以指定names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\.(.*)")
或根据 ?pivot_longer
使用带转义 .
的 names_sep
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
这意味着默认情况下正则表达式是 on 并且正则表达式中的 .
匹配任何字符而不是文字点。要获取文字值,请转义或将其放在方括号
内
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
更新
对于更新后的数据集
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\.[^.]+)\.([^.]+)$", "\1\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\.(.*)")
或者没有 rename
,使用正则表达式查找
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\.")
数据
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))
我正在尝试 melt/stack/gather 将数据框的多个特定列分成 2 列,同时保留所有其他列。 我在 Whosebug 上尝试了很多很多答案但没有成功(下面的一些)。我这里基本上有类似 post 的情况: Reshaping multiple sets of measurement columns (wide format) into single columns (long format) 只有更多的列可以保留和组合。重要的是要提到我的年份列是因素,我的列比下面列出的示例多得多,所以我想调用列名而不是位置。
>df
ID Code Country year.x value.x year.y value.y year.x.x value.x.x
1 A USA 2000 34.33422 2001 35.35241 2002 42.30042
1 A Spain 2000 34.71842 2001 39.82727 2002 43.22209
3 B USA 2000 35.98180 2001 37.70768 2002 44.40232
3 B Peru 2000 33.00000 2001 37.66468 2002 41.30232
4 C Argentina 2000 37.78005 2001 39.25627 2002 45.72927
4 C Peru 2000 40.52575 2001 40.55918 2002 46.62914
我在上面的post的基础上尝试在tidyr中使用pivot_longer,看起来很相似,这导致了各种错误,具体取决于我的操作:
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = ".")
我还在 reshape2 中以各种方式玩过 melt,这些方法要么只熔化值列,要么只熔化年份列。如:
new.df <- reshape2:::melt(df, id.var = c("ID", "Code", "Country"), measure.vars=c("value.x", "value.y", "value.x.x", "value.y.y", "value.x.x.x", "value.y.y.y"), value.name = "value", variable.vars=c('year.x','year.y', "year.x.x", "year.y.y", "year.x.x.x", "year.y.y.y", "value.x", variable.name = "year")
我也尝试过基于其他 posts 的 dplyr gather,但我发现很难理解帮助页面和 posts。 明确我想要实现的目标:
ID Code Country year value
1 A USA 2000 34.33422
1 A Spain 2000 34.71842
3 B USA 2000 35.98180
3 B Peru 2000 33.00000
4 C Argentina2000 37.78005
4 C Peru 2000 40.52575
1 A USA 2001 35.35241
1 A Spain 2001 39.82727
3 B USA 2001 37.70768
3 B Peru 2001 37.66468
4 C Argentina2001 39.25627
4 C Peru 2001 40.55918
1 A USA 2002 42.30042
etc.
非常感谢这里的帮助。
我们可以指定names_pattern
library(tidyr)
library(dplyr)
df %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\.(.*)")
或根据 ?pivot_longer
.
的 names_sep
names_sep - names_sep takes the same specification as separate(), and can either be a numeric vector (specifying positions to break on), or a single string (specifying a regular expression to split on).
这意味着默认情况下正则表达式是 on 并且正则表达式中的 .
匹配任何字符而不是文字点。要获取文字值,请转义或将其放在方括号
pivot_longer(df,
cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "\.")
# A tibble: 18 x 6
# ID Code Country group year value
# <int> <chr> <chr> <chr> <int> <dbl>
# 1 1 A USA x 2000 34.3
# 2 1 A USA y 2001 35.4
# 3 1 A USA z 2002 42.3
# 4 1 A Spain x 2000 34.7
# 5 1 A Spain y 2001 39.8
# 6 1 A Spain z 2002 43.2
# 7 3 B USA x 2000 36.0
# 8 3 B USA y 2001 37.7
# 9 3 B USA z 2002 44.4
#10 3 B Peru x 2000 33
#11 3 B Peru y 2001 37.7
#12 3 B Peru z 2002 41.3
#13 4 C Argentina x 2000 37.8
#14 4 C Argentina y 2001 39.3
#15 4 C Argentina z 2002 45.7
#16 4 C Peru x 2000 40.5
#17 4 C Peru y 2001 40.6
#18 4 C Peru z 2002 46.6
更新
对于更新后的数据集
library(stringr)
df2 %>%
rename_at(vars(matches("year|value")), ~
str_replace(., "^([^.]+\.[^.]+)\.([^.]+)$", "\1\2")) %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),names_pattern = "(.*)\.(.*)")
或者没有 rename
,使用正则表达式查找
df2 %>%
pivot_longer(cols = -c(ID, Code, Country),
names_to = c(".value", "group"),
names_sep = "(?<=year|value)\.")
数据
df <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.z = c(2002L, 2002L, 2002L, 2002L, 2002L,
2002L), value.z = c(42.30042, 43.22209, 44.40232, 41.30232, 45.72927,
46.62914)), class = "data.frame", row.names = c(NA, -6L))
df2 <- structure(list(ID = c(1L, 1L, 3L, 3L, 4L, 4L), Code = c("A",
"A", "B", "B", "C", "C"), Country = c("USA", "Spain", "USA",
"Peru", "Argentina", "Peru"), year.x = c(2000L, 2000L, 2000L,
2000L, 2000L, 2000L), value.x = c(34.33422, 34.71842, 35.9818,
33, 37.78005, 40.52575), year.y = c(2001L, 2001L, 2001L, 2001L,
2001L, 2001L), value.y = c(35.35241, 39.82727, 37.70768, 37.66468,
39.25627, 40.55918), year.x.x = c(2002L, 2002L, 2002L, 2002L,
2002L, 2002L), value.x.x = c(42.30042, 43.22209, 44.40232, 41.30232,
45.72927, 46.62914)), class = "data.frame", row.names = c(NA,
-6L))