在 R 中重新排列数据框中的信息
Rearranging information from data frame in R
我有以下 df,它是从 excel 文件中获得的:
df1 <- data.frame( Colour = c("Green","Red","Blue"),
Code = c("N","U", "U"),
User1 = c("John","Brad","Peter"),
User2 = c("Meg","Meg","John"),
User3= c("", "Lucy", ""))
我需要重新排列它以获得一个数据框,其中所有名称都列在第一列(仅一次)并且颜色(和各自的代码)出现在以下列中,如图所示:
df2 <- data.frame(User=c("John","Brad","Peter","Meg","Lucy"),
Color1 = c("Green","Red","Blue","Green","Red"),
Code1 = c("N","U","U","N","U"),
Color2=c("Blue","","","Red",""),
Code2=c("U","","","U",""))
非常感谢您的帮助。非常感谢,
我们可以使用 data.table
的开发版本 dcast
,即 v1.9.5+。它可以包含多个 value.var
列。我们将 data.frame
转换为 data.table
(setDT(df1)
),melt
id 列为 'Colour' 和 'Code' 的数据,删除其中 'User'不等于''([User!='']
),根据'User'列和dcast
创建分组序列。安装说明是 here
library(data.table)#v1.9.5+
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'),
value.name='User')[User!=''][,
N:=1:.N, User], User~N, value.var=c('Colour', 'Code'))
# User 1_Colour 2_Colour 1_Code 2_Code
#1: Brad Red NA U NA
#2: John Green Blue N U
#3: Lucy Red NA U NA
#4: Meg Green Red N U
#5: Peter Blue NA U NA
或者正如@Arun 在评论中提到的,我们可以在 dcast
中使用 subset
参数而不是 [User!='']
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'),
value.name='User')[,N:= 1:.N, User],
subset=.(User !=''), User~N, value.var=c('Colour', 'Code'))
# User 1_Colour 2_Colour 1_Code 2_Code
#1: Brad Red NA U NA
#2: John Green Blue N U
#3: Lucy Red NA U NA
#4: Meg Green Red N U
#5: Peter Blue NA U NA
我对 post 犹豫不决,因为与@akrun 的回答在概念上相似,但您也可以使用我的 "splitstackshape" 包中的 merged.stack
和 reshape
来自基础 R.
library(splitstackshape)
reshape(
getanID(
merged.stack(df1, var.stubs = "User", sep = "var.stubs")[User != ""],
"User"),
direction = "wide", idvar = "User", timevar = ".id", drop = ".time_1")
# User Colour.1 Code.1 Colour.2 Code.2
# 1: Peter Blue U NA NA
# 2: John Blue U Green N
# 3: Meg Green N Red U
# 4: Brad Red U NA NA
# 5: Lucy Red U NA NA
merged.stack
使数据变长,getanID
创建一个 ID 变量,以便在转到宽格式时使用,reshape
进行从这种半宽格式到一个广泛的形式。
这是我能为 "dplyr" + "tidyr" 用户想到的最好的。看起来很冗长,但应该不会太难理解:
library(dplyr)
library(tidyr)
df1 %>%
gather(var, User, User1:User3) %>% # Get the data into a long form
filter(User != "") %>% # Drop empty rows
group_by(User) %>% # Group by User
mutate(Id = sequence(n())) %>% # Create a new id variable
gather(var2, value, Colour, Code) %>% # Go long a second time
unite(Key, var2, Id) %>% # Combine values to create a key
spread(Key, value, fill = "") # Convert back to a wide form
# Source: local data frame [6 x 6]
#
# var User Code_1 Code_2 Colour_1 Colour_2
# 1 User1 Brad U Red
# 2 User1 John N Green
# 3 User1 Peter U Blue
# 4 User2 John U Blue
# 5 User2 Meg N U Green Red
# 6 User3 Lucy U Red
它不是很漂亮,但这是纯基础 R 中的另一个解决方案,它使用了对 reshape()
:
的几次调用
reshape(transform(subset(reshape(df1,varying=grep('^User',names(df1)),dir='l',v.names='User'),User!=''),id=NULL,time=ave(c(User),User,FUN=seq_along),User=factor(User)),dir='w',idvar='User',sep='');
## User Colour1 Code1 Colour2 Code2
## 1.1 John Green N Blue U
## 2.1 Brad Red U <NA> <NA>
## 3.1 Peter Blue U <NA> <NA>
## 1.2 Meg Green N Red U
## 2.3 Lucy Red U <NA> <NA>
我有以下 df,它是从 excel 文件中获得的:
df1 <- data.frame( Colour = c("Green","Red","Blue"),
Code = c("N","U", "U"),
User1 = c("John","Brad","Peter"),
User2 = c("Meg","Meg","John"),
User3= c("", "Lucy", ""))
我需要重新排列它以获得一个数据框,其中所有名称都列在第一列(仅一次)并且颜色(和各自的代码)出现在以下列中,如图所示:
df2 <- data.frame(User=c("John","Brad","Peter","Meg","Lucy"),
Color1 = c("Green","Red","Blue","Green","Red"),
Code1 = c("N","U","U","N","U"),
Color2=c("Blue","","","Red",""),
Code2=c("U","","","U",""))
非常感谢您的帮助。非常感谢,
我们可以使用 data.table
的开发版本 dcast
,即 v1.9.5+。它可以包含多个 value.var
列。我们将 data.frame
转换为 data.table
(setDT(df1)
),melt
id 列为 'Colour' 和 'Code' 的数据,删除其中 'User'不等于''([User!='']
),根据'User'列和dcast
创建分组序列。安装说明是 here
library(data.table)#v1.9.5+
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'),
value.name='User')[User!=''][,
N:=1:.N, User], User~N, value.var=c('Colour', 'Code'))
# User 1_Colour 2_Colour 1_Code 2_Code
#1: Brad Red NA U NA
#2: John Green Blue N U
#3: Lucy Red NA U NA
#4: Meg Green Red N U
#5: Peter Blue NA U NA
或者正如@Arun 在评论中提到的,我们可以在 dcast
中使用 subset
参数而不是 [User!='']
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'),
value.name='User')[,N:= 1:.N, User],
subset=.(User !=''), User~N, value.var=c('Colour', 'Code'))
# User 1_Colour 2_Colour 1_Code 2_Code
#1: Brad Red NA U NA
#2: John Green Blue N U
#3: Lucy Red NA U NA
#4: Meg Green Red N U
#5: Peter Blue NA U NA
我对 post 犹豫不决,因为与@akrun 的回答在概念上相似,但您也可以使用我的 "splitstackshape" 包中的 merged.stack
和 reshape
来自基础 R.
library(splitstackshape)
reshape(
getanID(
merged.stack(df1, var.stubs = "User", sep = "var.stubs")[User != ""],
"User"),
direction = "wide", idvar = "User", timevar = ".id", drop = ".time_1")
# User Colour.1 Code.1 Colour.2 Code.2
# 1: Peter Blue U NA NA
# 2: John Blue U Green N
# 3: Meg Green N Red U
# 4: Brad Red U NA NA
# 5: Lucy Red U NA NA
merged.stack
使数据变长,getanID
创建一个 ID 变量,以便在转到宽格式时使用,reshape
进行从这种半宽格式到一个广泛的形式。
这是我能为 "dplyr" + "tidyr" 用户想到的最好的。看起来很冗长,但应该不会太难理解:
library(dplyr)
library(tidyr)
df1 %>%
gather(var, User, User1:User3) %>% # Get the data into a long form
filter(User != "") %>% # Drop empty rows
group_by(User) %>% # Group by User
mutate(Id = sequence(n())) %>% # Create a new id variable
gather(var2, value, Colour, Code) %>% # Go long a second time
unite(Key, var2, Id) %>% # Combine values to create a key
spread(Key, value, fill = "") # Convert back to a wide form
# Source: local data frame [6 x 6]
#
# var User Code_1 Code_2 Colour_1 Colour_2
# 1 User1 Brad U Red
# 2 User1 John N Green
# 3 User1 Peter U Blue
# 4 User2 John U Blue
# 5 User2 Meg N U Green Red
# 6 User3 Lucy U Red
它不是很漂亮,但这是纯基础 R 中的另一个解决方案,它使用了对 reshape()
:
reshape(transform(subset(reshape(df1,varying=grep('^User',names(df1)),dir='l',v.names='User'),User!=''),id=NULL,time=ave(c(User),User,FUN=seq_along),User=factor(User)),dir='w',idvar='User',sep='');
## User Colour1 Code1 Colour2 Code2
## 1.1 John Green N Blue U
## 2.1 Brad Red U <NA> <NA>
## 3.1 Peter Blue U <NA> <NA>
## 1.2 Meg Green N Red U
## 2.3 Lucy Red U <NA> <NA>