在 R 中重新排列数据框中的信息

Rearranging information from data frame in R

我有以下 df,它是从 excel 文件中获得的:

df1 <- data.frame( Colour = c("Green","Red","Blue"), 
                   Code = c("N","U", "U"), 
                   User1 = c("John","Brad","Peter"), 
                   User2 = c("Meg","Meg","John"), 
                   User3= c("", "Lucy", ""))

我需要重新排列它以获得一个数据框,其中所有名称都列在第一列(仅一次)并且颜色(和各自的代码)出现在以下列中,如图所示:

df2 <- data.frame(User=c("John","Brad","Peter","Meg","Lucy"),
                  Color1 = c("Green","Red","Blue","Green","Red"),
                  Code1 = c("N","U","U","N","U"), 
                  Color2=c("Blue","","","Red",""),
                  Code2=c("U","","","U",""))

非常感谢您的帮助。非常感谢,

我们可以使用 data.table 的开发版本 dcast,即 v1.9.5+。它可以包含多个 value.var 列。我们将 data.frame 转换为 data.table (setDT(df1)),melt id 列为 'Colour' 和 'Code' 的数据,删除其中 'User'不等于''([User!='']),根据'User'列和dcast创建分组序列。安装说明是 here

library(data.table)#v1.9.5+
dcast(melt(setDT(df1), id.var=c('Colour', 'Code'), 
           value.name='User')[User!=''][,
              N:=1:.N, User], User~N, value.var=c('Colour', 'Code'))
#    User 1_Colour 2_Colour 1_Code 2_Code
#1:  Brad      Red       NA      U     NA
#2:  John    Green     Blue      N      U
#3:  Lucy      Red       NA      U     NA
#4:   Meg    Green      Red      N      U
#5: Peter     Blue       NA      U     NA

或者正如@Arun 在评论中提到的,我们可以在 dcast 中使用 subset 参数而不是 [User!='']

dcast(melt(setDT(df1), id.var=c('Colour', 'Code'), 
             value.name='User')[,N:= 1:.N, User],
       subset=.(User !=''), User~N, value.var=c('Colour', 'Code'))
#    User 1_Colour 2_Colour 1_Code 2_Code
#1:  Brad      Red       NA      U     NA
#2:  John    Green     Blue      N      U
#3:  Lucy      Red       NA      U     NA
#4:   Meg    Green      Red      N      U
#5: Peter     Blue       NA      U     NA

我对 post 犹豫不决,因为与@akrun 的回答在概念上相似,但您也可以使用我的 "splitstackshape" 包中的 merged.stackreshape 来自基础 R.

library(splitstackshape)
reshape(
  getanID(
    merged.stack(df1, var.stubs = "User", sep = "var.stubs")[User != ""], 
    "User"), 
  direction = "wide", idvar = "User", timevar = ".id", drop = ".time_1")
#     User Colour.1 Code.1 Colour.2 Code.2
# 1: Peter     Blue      U       NA     NA
# 2:  John     Blue      U    Green      N
# 3:   Meg    Green      N      Red      U
# 4:  Brad      Red      U       NA     NA
# 5:  Lucy      Red      U       NA     NA

merged.stack 使数据变长,getanID 创建一个 ID 变量,以便在转到宽格式时使用,reshape 进行从这种半宽格式到一个广泛的形式。


这是我能为 "dplyr" + "tidyr" 用户想到的最好的。看起来很冗长,但应该不会太难理解:

library(dplyr)
library(tidyr)

df1 %>%
  gather(var, User, User1:User3) %>%      # Get the data into a long form
  filter(User != "") %>%                  # Drop empty rows
  group_by(User) %>%                      # Group by User
  mutate(Id = sequence(n())) %>%          # Create a new id variable
  gather(var2, value, Colour, Code) %>%   # Go long a second time
  unite(Key, var2, Id) %>%                # Combine values to create a key
  spread(Key, value, fill = "")           # Convert back to a wide form
# Source: local data frame [6 x 6]
# 
#     var  User Code_1 Code_2 Colour_1 Colour_2
# 1 User1  Brad      U             Red         
# 2 User1  John      N           Green         
# 3 User1 Peter      U            Blue         
# 4 User2  John             U              Blue
# 5 User2   Meg      N      U    Green      Red
# 6 User3  Lucy      U             Red         

它不是很漂亮,但这是纯基础 R 中的另一个解决方案,它使用了对 reshape():

的几次调用
reshape(transform(subset(reshape(df1,varying=grep('^User',names(df1)),dir='l',v.names='User'),User!=''),id=NULL,time=ave(c(User),User,FUN=seq_along),User=factor(User)),dir='w',idvar='User',sep='');
##      User Colour1 Code1 Colour2 Code2
## 1.1  John   Green     N    Blue     U
## 2.1  Brad     Red     U    <NA>  <NA>
## 3.1 Peter    Blue     U    <NA>  <NA>
## 1.2   Meg   Green     N     Red     U
## 2.3  Lucy     Red     U    <NA>  <NA>