用于铸造(传播)多列字符向量的优雅解决方案
Elegant solution for casting (spreading) multiple columns of character vectors
我想将包含联系信息的数据框转换为一个城市列表,其中包含类似信息,例如phone 数字出现在多列中。
我试过同时使用 reshape2::dcast()
和 tidyr::spread()
,但都没有解决我的问题。我还检查了其他 post 的堆栈溢出,例如
Multiple column spread
尚未找到有效的解决方案。在我看来,这些问题应该相当简单(并且可以通过 spread 或 dcast 解决)。
tmp <- tibble(municipality = c("M1", "M2"),
name1 = c("n1", "n2"), name2 = c("n3", "n4"), name3 = c(NA, "n5"), # placeholder names
phone1 = c("p1", "p2"), phone2 = c("p3", "p4"), phone3 = c(NA, "p5")) # placeholder phone numbers
#solution 1
tmp %>% gather("colname", "value", -municipality) %>%
filter(municipality == "M1") %>% #too simplify, should be replaced with group_by(municipality)
na.omit() %>% mutate(colname = str_replace(colname, "\d", replacement = "")) %>%
spread(., key = "colname", value = "value")
#Solution 2
tmp %>% gather("colname", "value", -municipality) %>%
filter(municipality == "M1") %>% # same as above
na.omit() %>% mutate(colname = str_replace(colname, "\d", replacement = "")) %>%
dcast(municipality + value ~colname)
解决方案 1 导致以下错误:
错误:输出的每一行都必须由唯一的键组合来标识。
解决方案 2 产生以下数据框(除了需要折叠外,这是期望的结果):
municipality value name phone
1 M1 n1 n1 <NA>
2 M1 n3 n3 <NA>
3 M1 p1 <NA> p1
4 M1 p3 <NA> p3
你在找吗?
library(dplyr)
library(tidyr)
tmp %>%
gather(key, value, -municipality, na.rm = TRUE) %>%
mutate(key = gsub("\d+", "", key)) %>%
group_by(municipality, key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# municipality name phone
# <chr> <chr> <chr>
#1 M1 n1 p1
#2 M1 n3 p3
#3 M2 n2 p2
#4 M2 n4 p4
#5 M2 n5 p5
我们可以使用 gather
将数据以长格式删除 NA
值。从各个列名称中删除数字,以便它们共享相同的 key
,创建列 group_by
municipality
和 key
到 spread
将数据转换为宽格式。
我们可以使用 tidyr
的开发版本中的 pivot_longer
优雅地做到这一点
library(dplyr)
library(tidyr)# 0.8.3.9000
library(stringr)
tmp %>%
rename_at(-1, ~str_replace(., "(\d+$)", "_\1")) %>%
pivot_longer(cols = -municipality, names_to = c(".value", "group"),
names_sep="_", values_drop_na = TRUE) %>%
select(-group)
# A tibble: 5 x 3
# municipality name phone
# <chr> <chr> <chr>
#1 M1 n1 p1
#2 M1 n3 p3
#3 M2 n2 p2
#4 M2 n4 p4
#5 M2 n5 p5
或者另一个选项是 melt
来自 data.table
library(data.table)
melt(setDT(tmp), measure = patterns("^name", "^phone"),
value.name = c("name", "phone"), na.rm = TRUE)[, variable := NULL][]
#. municipality name phone
#1: M1 n1 p1
#2: M2 n2 p2
#3: M1 n3 p3
#4: M2 n4 p4
#5: M2 n5 p5
我想将包含联系信息的数据框转换为一个城市列表,其中包含类似信息,例如phone 数字出现在多列中。
我试过同时使用 reshape2::dcast()
和 tidyr::spread()
,但都没有解决我的问题。我还检查了其他 post 的堆栈溢出,例如
Multiple column spread
尚未找到有效的解决方案。在我看来,这些问题应该相当简单(并且可以通过 spread 或 dcast 解决)。
tmp <- tibble(municipality = c("M1", "M2"),
name1 = c("n1", "n2"), name2 = c("n3", "n4"), name3 = c(NA, "n5"), # placeholder names
phone1 = c("p1", "p2"), phone2 = c("p3", "p4"), phone3 = c(NA, "p5")) # placeholder phone numbers
#solution 1
tmp %>% gather("colname", "value", -municipality) %>%
filter(municipality == "M1") %>% #too simplify, should be replaced with group_by(municipality)
na.omit() %>% mutate(colname = str_replace(colname, "\d", replacement = "")) %>%
spread(., key = "colname", value = "value")
#Solution 2
tmp %>% gather("colname", "value", -municipality) %>%
filter(municipality == "M1") %>% # same as above
na.omit() %>% mutate(colname = str_replace(colname, "\d", replacement = "")) %>%
dcast(municipality + value ~colname)
解决方案 1 导致以下错误: 错误:输出的每一行都必须由唯一的键组合来标识。
解决方案 2 产生以下数据框(除了需要折叠外,这是期望的结果):
municipality value name phone
1 M1 n1 n1 <NA>
2 M1 n3 n3 <NA>
3 M1 p1 <NA> p1
4 M1 p3 <NA> p3
你在找吗?
library(dplyr)
library(tidyr)
tmp %>%
gather(key, value, -municipality, na.rm = TRUE) %>%
mutate(key = gsub("\d+", "", key)) %>%
group_by(municipality, key) %>%
mutate(row = row_number()) %>%
spread(key, value) %>%
select(-row)
# municipality name phone
# <chr> <chr> <chr>
#1 M1 n1 p1
#2 M1 n3 p3
#3 M2 n2 p2
#4 M2 n4 p4
#5 M2 n5 p5
我们可以使用 gather
将数据以长格式删除 NA
值。从各个列名称中删除数字,以便它们共享相同的 key
,创建列 group_by
municipality
和 key
到 spread
将数据转换为宽格式。
我们可以使用 tidyr
pivot_longer
优雅地做到这一点
library(dplyr)
library(tidyr)# 0.8.3.9000
library(stringr)
tmp %>%
rename_at(-1, ~str_replace(., "(\d+$)", "_\1")) %>%
pivot_longer(cols = -municipality, names_to = c(".value", "group"),
names_sep="_", values_drop_na = TRUE) %>%
select(-group)
# A tibble: 5 x 3
# municipality name phone
# <chr> <chr> <chr>
#1 M1 n1 p1
#2 M1 n3 p3
#3 M2 n2 p2
#4 M2 n4 p4
#5 M2 n5 p5
或者另一个选项是 melt
来自 data.table
library(data.table)
melt(setDT(tmp), measure = patterns("^name", "^phone"),
value.name = c("name", "phone"), na.rm = TRUE)[, variable := NULL][]
#. municipality name phone
#1: M1 n1 p1
#2: M2 n2 p2
#3: M1 n3 p3
#4: M2 n4 p4
#5: M2 n5 p5