R:如何使用连接在单个列中的 var-val 对整理数据
R: How to tidyr up data with var-val pairs concatenated in a single column
我已经尝试在 SO and 上解决这个问题 - 因为得到了很好的答案,但意识到这只是我认为是普遍问题的部分解决方案:通常数据已被组织为具有变量(显然是最有趣的)作为每个变量一列,然后是最后一列,其中几个变量值对放在一起。我一直在努力寻找一种通用方法,将最后一列变量转换为单独的列,这种整理数据不应该是 tidyr
的工作吗?
require(dplyr)
require(stringr)
data <-
data.frame(
shoptype=c("A","B","B"),
city=c("bah", "bah", "slah"),
sale=c("type cheese; price 200", "type ham; price 150","type cheese; price 100" )) %>%
tbl_df()
> data
Source: local data frame [3 x 3]
shoptype city sale
1 A bah type cheese; price 200
2 B bah type ham; price 150
3 B slah type cheese; price 100
这里我们有一些城市的一些商店的信息,这些信息有一个串联的列,变量用“;”分隔。和 var-val space。
人们想要这样的输出:
shoptype city type price
1 A bah cheese 200
2 B bah ham 150
3 B slah cheese 100
当所有行都可以做到时(请参阅链接的 SO 问题)
require(plyr)
require(dplyr)
require(stringr)
require(tidyr)
data %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
但是,如果我们将第二行的商店类型更改为 "A",我们会因此出现错误。喜欢:
data2 <-
data.frame(
shoptype=c("A","A","B"),
city=c("bah", "bah", "slah"),
sale=c("type cheese; price 200", "type ham; price 150","type cheese; price 100" )) %>%
tbl_df()
data2 %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
Error: Duplicate identifiers for rows (2, 4), (1, 3)
我试图用一个唯一的 id 来解决这个问题(再次查看链接的 SO 答案):
data2 %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale),
v0=rownames(.)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
Source: local data frame [6 x 5]
shoptype city v0 price type
1 A bah 1 NA cheese
2 A bah 2 200 NA
3 A bah 3 NA ham
4 A bah 4 150 NA
5 B slah 5 NA cheese
6 B slah 6 100 NA
它提供了结构性缺失数据,我不知道如何按照上面我想要的输出中的描述收集这些数据。
我想我真的遗漏了 tidyr 范围内的东西(我希望!)。
拆分前添加次要ID:
data2 %>%
group_by(shoptype, city) %>%
mutate(id2 = sequence(n())) %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
# Source: local data frame [3 x 5]
#
# shoptype city id2 price type
# 1 A bah 1 200 cheese
# 2 A bah 2 150 ham
# 3 B slah 1 100 cheese
如果你使用我的 "splitstackshape" 包中的一些函数,代码可以变得更紧凑:
as.data.frame(data2) %>%
getanID(c("shoptype", "city")) %>%
cSplit("sale", ";", "long") %>%
cSplit("sale", " ") %>%
spread(sale_1, sale_2)
# shoptype city .id price type
# 1: A bah 1 200 cheese
# 2: A bah 2 150 ham
# 3: B slah 1 100 cheese
我认为没有必要使用 tidyr::unnest
和 tidyr::gather
。这是一个专注于 stringr::str_replace
和 tidyr::separate
的替代解决方案:
library(dplyr)
library(stringr)
library(tidyr)
data2 %>%
mutate(
sale = str_replace(sale, "type ", ""),
sale = str_replace(sale, " price ", "")
) %>%
separate(sale, into = c("type", "price"), sep = ";")
# Source: local data frame [3 x 4]
# shoptype city type price
# 1 A bah cheese 200
# 2 A bah ham 150
# 3 B slah cheese 100
上面有两个很好的答案,但认为这对 extract
来说是个不错的情况
data2 %>%
extract(sale, c("type", "price"), "type (.+); price (.+)", convert = TRUE)
我已经尝试在 SO tidyr
的工作吗?
require(dplyr)
require(stringr)
data <-
data.frame(
shoptype=c("A","B","B"),
city=c("bah", "bah", "slah"),
sale=c("type cheese; price 200", "type ham; price 150","type cheese; price 100" )) %>%
tbl_df()
> data
Source: local data frame [3 x 3]
shoptype city sale
1 A bah type cheese; price 200
2 B bah type ham; price 150
3 B slah type cheese; price 100
这里我们有一些城市的一些商店的信息,这些信息有一个串联的列,变量用“;”分隔。和 var-val space。 人们想要这样的输出:
shoptype city type price
1 A bah cheese 200
2 B bah ham 150
3 B slah cheese 100
当所有行都可以做到时(请参阅链接的 SO 问题)
require(plyr)
require(dplyr)
require(stringr)
require(tidyr)
data %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
但是,如果我们将第二行的商店类型更改为 "A",我们会因此出现错误。喜欢:
data2 <-
data.frame(
shoptype=c("A","A","B"),
city=c("bah", "bah", "slah"),
sale=c("type cheese; price 200", "type ham; price 150","type cheese; price 100" )) %>%
tbl_df()
data2 %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
Error: Duplicate identifiers for rows (2, 4), (1, 3)
我试图用一个唯一的 id 来解决这个问题(再次查看链接的 SO 答案):
data2 %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale),
v0=rownames(.)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
Source: local data frame [6 x 5]
shoptype city v0 price type
1 A bah 1 NA cheese
2 A bah 2 200 NA
3 A bah 3 NA ham
4 A bah 4 150 NA
5 B slah 5 NA cheese
6 B slah 6 100 NA
它提供了结构性缺失数据,我不知道如何按照上面我想要的输出中的描述收集这些数据。
我想我真的遗漏了 tidyr 范围内的东西(我希望!)。
拆分前添加次要ID:
data2 %>%
group_by(shoptype, city) %>%
mutate(id2 = sequence(n())) %>%
mutate(sale = str_split(as.character(sale), "; ")) %>%
unnest(sale) %>%
mutate(sale = str_trim(sale)) %>%
separate(sale, into = c("var", "val")) %>%
spread(var, val)
# Source: local data frame [3 x 5]
#
# shoptype city id2 price type
# 1 A bah 1 200 cheese
# 2 A bah 2 150 ham
# 3 B slah 1 100 cheese
如果你使用我的 "splitstackshape" 包中的一些函数,代码可以变得更紧凑:
as.data.frame(data2) %>%
getanID(c("shoptype", "city")) %>%
cSplit("sale", ";", "long") %>%
cSplit("sale", " ") %>%
spread(sale_1, sale_2)
# shoptype city .id price type
# 1: A bah 1 200 cheese
# 2: A bah 2 150 ham
# 3: B slah 1 100 cheese
我认为没有必要使用 tidyr::unnest
和 tidyr::gather
。这是一个专注于 stringr::str_replace
和 tidyr::separate
的替代解决方案:
library(dplyr)
library(stringr)
library(tidyr)
data2 %>%
mutate(
sale = str_replace(sale, "type ", ""),
sale = str_replace(sale, " price ", "")
) %>%
separate(sale, into = c("type", "price"), sep = ";")
# Source: local data frame [3 x 4]
# shoptype city type price
# 1 A bah cheese 200
# 2 A bah ham 150
# 3 B slah cheese 100
上面有两个很好的答案,但认为这对 extract
data2 %>%
extract(sale, c("type", "price"), "type (.+); price (.+)", convert = TRUE)