从 data.frame 行中提取字符列表值并重塑数据
Extract character list values from data.frame rows and reshape data
我有一个变量 x
,每行都有字符列表:
dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'),
x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'),
stringsAsFactors = F)
我想重塑数据,使每一行都是唯一的(id
、x
)对,例如:
dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)),
x = c('f','o','a','b','r','a','b'))
> dat2
id x
1 a f
2 a o
3 b a
4 b b
5 b r
6 c a
7 c b
我试图通过拆分字符列表并在每一行中只保留唯一的列表值来做到这一点:
dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)
> dat
id x
1 a f, o
3 b a, b, r
5 c a, b
但是,我不确定如何继续将行列表转换为单独的行条目。
我将如何做到这一点? 或者是否有更有效的方法来转换字符串列表以按所述重塑数据?
您可以使用 tidytext::unnest_tokens
:
library(tidytext)
library(dplyr)
dat %>%
unnest_tokens(x1, x) %>%
distinct()
id x1
1 a f
2 a o
3 b b
4 b a
5 b r
6 c b
7 c a
可以使用 splitstackshape::cSplit
将 x
列拆分为多个列来实现解决方案。然后 gather
和过滤器将有助于实现所需的输出。
library(tidyverse)
library(splitstackshape)
dat %>% cSplit("x", sep=",") %>%
mutate_if(is.factor, as.character) %>%
gather(key, value, -id) %>%
filter(!is.na(value)) %>%
select(-key) %>% unique()
# id value
# 1 a f
# 3 b b
# 5 c b
# 6 a o
# 8 b a
# 10 c a
# 13 b r
基本解决方案:
temp <- do.call(rbind, apply( dat, 1,
function(z){ data.frame(
id=z[1],
x = scan(text=z['x'], what="",sep=","),
stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
temp[!duplicated(temp),]
#------
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
要删除所有消息和警告:
temp <- do.call(rbind, apply( dat, 1,
function(z){ suppressWarnings(data.frame(id=z[1],
x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
)} ) )
temp[!duplicated(temp),]
有两行的基础 R 方法是
#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))
这个returns
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
如果不想自己写出变量名,可以使用setNames
.
setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))
我们可以使用 separate_rows
library(tidyverse)
dat %>%
separate_rows(x) %>%
distinct()
# id x
#1 a f
#2 a o
#3 b b
#4 b a
#5 b r
#6 c b
#7 c a
我有一个变量 x
,每行都有字符列表:
dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'),
x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'),
stringsAsFactors = F)
我想重塑数据,使每一行都是唯一的(id
、x
)对,例如:
dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)),
x = c('f','o','a','b','r','a','b'))
> dat2
id x
1 a f
2 a o
3 b a
4 b b
5 b r
6 c a
7 c b
我试图通过拆分字符列表并在每一行中只保留唯一的列表值来做到这一点:
dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)
> dat
id x
1 a f, o
3 b a, b, r
5 c a, b
但是,我不确定如何继续将行列表转换为单独的行条目。
我将如何做到这一点? 或者是否有更有效的方法来转换字符串列表以按所述重塑数据?
您可以使用 tidytext::unnest_tokens
:
library(tidytext)
library(dplyr)
dat %>%
unnest_tokens(x1, x) %>%
distinct()
id x1
1 a f
2 a o
3 b b
4 b a
5 b r
6 c b
7 c a
可以使用 splitstackshape::cSplit
将 x
列拆分为多个列来实现解决方案。然后 gather
和过滤器将有助于实现所需的输出。
library(tidyverse)
library(splitstackshape)
dat %>% cSplit("x", sep=",") %>%
mutate_if(is.factor, as.character) %>%
gather(key, value, -id) %>%
filter(!is.na(value)) %>%
select(-key) %>% unique()
# id value
# 1 a f
# 3 b b
# 5 c b
# 6 a o
# 8 b a
# 10 c a
# 13 b r
基本解决方案:
temp <- do.call(rbind, apply( dat, 1,
function(z){ data.frame(
id=z[1],
x = scan(text=z['x'], what="",sep=","),
stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
temp[!duplicated(temp),]
#------
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
要删除所有消息和警告:
temp <- do.call(rbind, apply( dat, 1,
function(z){ suppressWarnings(data.frame(id=z[1],
x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
)} ) )
temp[!duplicated(temp),]
有两行的基础 R 方法是
#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))
这个returns
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
如果不想自己写出变量名,可以使用setNames
.
setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))
我们可以使用 separate_rows
library(tidyverse)
dat %>%
separate_rows(x) %>%
distinct()
# id x
#1 a f
#2 a o
#3 b b
#4 b a
#5 b r
#6 c b
#7 c a