Select 非 NA 值并根据列名分配变量
Select non-NA values and assign variables based on column names
我得到了一个数据集,其中参与者在 16 种可能的条件之一中进行了七次试验。 16 个条件来自 2x2x2x2 设计(即,有四个操纵变量,每个变量有两个水平)。假设 Var1 的级别为“Pot”和“Pan”。 Var2 具有“高”和“低”级别。 Var 3 具有“向上”和“向下”级别。 Var 4 具有级别“一”和“二”。
数据集包括每个参与者在每个条件下的每个观察值的列——也就是说,每一行有 112 (16*7) 列(以及一些包含人口统计资料等的列),105 (15* 7)其中是空的。条件在列标签中编码,因此列的范围从“PotHiUp1”到“PanLowDown2”。
因此数据如下所示:
Var1 <- c('Pot', 'Pan')
Var2 <- c('Hi', 'Low')
Var3 <- c('Up', 'Down')
Var4 <- c('One','Two')
Obs <- seq(1,7,1)
df <- expand.grid(Var1,Var2,Var3,Var4,Obs)
df <- df %>%
arrange(Var1,Var2,Var3,Var4)
x <- apply(df,1,paste,collapse="")
id <- seq(1,16,1)
age <- rep(20,16)
df <- as.data.frame(cbind(id, age))
for (i in 1:length(x)) {
df[,ncol(df)+1] <- NA
names(df)[ncol(df)] <- paste0(x[i])
}
j <- seq(3,ncol(df),7)
for (i in 1:nrow(df)) {
df[i,c(j[i]:(j[i]+6))] <- 10
}
我想整理此数据框,以便每一行有 4 列(每个变量一列)指定条件,7 列包含观察值。
我的解决方案是像这样使用 dplyr 过滤数据:
Df1 <- df %>%
filter(!is.na(PotHiUpOne1)) %>%
mutate(Var1 = 'pot', Var2 = 'hi', Var3 = 'up', Var4 = 'one')
然后像这样删除 NA 列:
Df1 <- Filter(function(x)!all(is.na(x)), Df1)
我这样做了 16 次(每个条件一次),然后在重命名剩余的 7 个观察列以使它们匹配后,最终将我创建的 16 个数据帧重新绑定在一起。
我想知道是否有人可以建议更有效的方法,最好使用 dplyr。
编辑:我应该补充一点,当我说 "efficient" 时,我的意思是一种更优雅的代码方式,而不是 运行 快速(数据集不大)的方法 - 即,不涉及将相同的代码块写出 16 次的东西。
希望这就是你想要的:
library(data.table)
dtt <- as.data.table(df)
dtt2 <- melt(dtt, id.vars = c('id', 'age'))[!is.na(value)]
dtt2[, c('var1', 'var2', 'var3', 'var4', 'cond') := tstrsplit(variable, '(?!^)(?=[A-Z0-9])', perl = T)]
dtt2[, variable := NULL]
dcast(dtt2, ... ~ cond, value.var = 'value')
# id age var1 var2 var3 var4 1 2 3 4 5 6 7
# 1: 1 20 Pot Hi Up One 10 10 10 10 10 10 10
# 2: 2 20 Pot Hi Up Two 10 10 10 10 10 10 10
# 3: 3 20 Pot Hi Down One 10 10 10 10 10 10 10
# 4: 4 20 Pot Hi Down Two 10 10 10 10 10 10 10
# 5: 5 20 Pot Low Up One 10 10 10 10 10 10 10
# 6: 6 20 Pot Low Up Two 10 10 10 10 10 10 10
# 7: 7 20 Pot Low Down One 10 10 10 10 10 10 10
# 8: 8 20 Pot Low Down Two 10 10 10 10 10 10 10
# 9: 9 20 Pan Hi Up One 10 10 10 10 10 10 10
# 10: 10 20 Pan Hi Up Two 10 10 10 10 10 10 10
# 11: 11 20 Pan Hi Down One 10 10 10 10 10 10 10
# 12: 12 20 Pan Hi Down Two 10 10 10 10 10 10 10
# 13: 13 20 Pan Low Up One 10 10 10 10 10 10 10
# 14: 14 20 Pan Low Up Two 10 10 10 10 10 10 10
# 15: 15 20 Pan Low Down One 10 10 10 10 10 10 10
# 16: 16 20 Pan Low Down Two 10 10 10 10 10 10 10
好吧,这不像 mt1022 的解决方案那么干净,但它不需要 data.table
。 case_when
函数需要 dplyr
,其他所有函数需要 base
。
定义两个新函数,find_conditions
和 transform
。
find_conditions
有点笨重,但可能会有用,因为您可以根据需要轻松添加新定义。
find_conditions <- function(x){
x1 <- x
x1 <- case_when(
x1 == "PotHiUpOne" ~ c("pot", "hi", "up", "one"),
x1 == "PotHiUpTwo" ~ c("pot", "hi", "up", "two"),
x1 == "PotHiDownOne" ~ c("pot", "hi", "down", "one"),
x1 == "PotHiDownTwo" ~ c("pot", "hi", "down", "two"),
x1 == "PotLowUpOne" ~ c("pot", "low", "up", "one"),
x1 == "PotLowUpTwo" ~ c("pot", "low", "up", "two"),
x1 == "PotLowDownOne" ~ c("pot", "low", "down", "one"),
x1 == "PotLowDownTwo" ~ c("pot", "low", "down", "two"),
x1 == "PanHiUpOne" ~ c("pan", "hi", "up", "one"),
x1 == "PanHiUpTwo" ~ c("pan", "hi", "up", "two"),
x1 == "PanHiDownOne" ~ c("pan", "hi", "down", "one"),
x1 == "PanHiDownTwo" ~ c("pan", "hi", "down", "two"),
x1 == "PanLowUpOne" ~ c("pan", "low", "up", "one"),
x1 == "PanLowUpTwo" ~ c("pan", "low", "up", "two"),
x1 == "PanLowDownOne" ~ c("pan", "low", "down", "one"),
x1 == "PanLowDownTwo" ~ c("pan", "low", "down", "two")
)
if(NA %in% x1){
cat("Error: Input not recognized")
}
else{
return(x1)
}
}
transform
从 df
中获取行并将其转换为我们想要的形式。这取决于我们已经定义的 find_conditions
函数。
transform <- function(row){
row1 <- row[3:length(row)] # Forget about id and age columns, will put them back at the end
cols <- colnames(row1)[!is.na(row1)] # Get names of the columns which are not NA
cols <- substr(cols,1,nchar(cols)-1) # Slice off the last character (The number)
cols <- cols[!duplicated(cols)] # Columns should all have the same name now - find it by removing duplicates
vars <- find_conditions(cols) # Use our new find_conditions function to break it up into individual conditions
row1 <- row1[!is.na(row1)] # Keep only non-NA values
new_row <- c(row[1:2],row1,vars) # put id, age, row1, vars together
as.vector(unlist(new_row)) # Return as an unnamed vector
}
现在使用这两个函数就很简单了:
l1 <- list() # Initialize empty list
for (i in 1:nrow(df)){
l1[[i]] <- transform(df[i,]) # Fill list with transformed rows
}
DF1 <- data.frame(do.call("rbind",l1)) # Bind the transformed rows together
如您所说,它不是一个大数据集,将其留在循环中。祝你好运!
我得到了一个数据集,其中参与者在 16 种可能的条件之一中进行了七次试验。 16 个条件来自 2x2x2x2 设计(即,有四个操纵变量,每个变量有两个水平)。假设 Var1 的级别为“Pot”和“Pan”。 Var2 具有“高”和“低”级别。 Var 3 具有“向上”和“向下”级别。 Var 4 具有级别“一”和“二”。
数据集包括每个参与者在每个条件下的每个观察值的列——也就是说,每一行有 112 (16*7) 列(以及一些包含人口统计资料等的列),105 (15* 7)其中是空的。条件在列标签中编码,因此列的范围从“PotHiUp1”到“PanLowDown2”。
因此数据如下所示:
Var1 <- c('Pot', 'Pan')
Var2 <- c('Hi', 'Low')
Var3 <- c('Up', 'Down')
Var4 <- c('One','Two')
Obs <- seq(1,7,1)
df <- expand.grid(Var1,Var2,Var3,Var4,Obs)
df <- df %>%
arrange(Var1,Var2,Var3,Var4)
x <- apply(df,1,paste,collapse="")
id <- seq(1,16,1)
age <- rep(20,16)
df <- as.data.frame(cbind(id, age))
for (i in 1:length(x)) {
df[,ncol(df)+1] <- NA
names(df)[ncol(df)] <- paste0(x[i])
}
j <- seq(3,ncol(df),7)
for (i in 1:nrow(df)) {
df[i,c(j[i]:(j[i]+6))] <- 10
}
我想整理此数据框,以便每一行有 4 列(每个变量一列)指定条件,7 列包含观察值。
我的解决方案是像这样使用 dplyr 过滤数据:
Df1 <- df %>%
filter(!is.na(PotHiUpOne1)) %>%
mutate(Var1 = 'pot', Var2 = 'hi', Var3 = 'up', Var4 = 'one')
然后像这样删除 NA 列:
Df1 <- Filter(function(x)!all(is.na(x)), Df1)
我这样做了 16 次(每个条件一次),然后在重命名剩余的 7 个观察列以使它们匹配后,最终将我创建的 16 个数据帧重新绑定在一起。
我想知道是否有人可以建议更有效的方法,最好使用 dplyr。
编辑:我应该补充一点,当我说 "efficient" 时,我的意思是一种更优雅的代码方式,而不是 运行 快速(数据集不大)的方法 - 即,不涉及将相同的代码块写出 16 次的东西。
希望这就是你想要的:
library(data.table)
dtt <- as.data.table(df)
dtt2 <- melt(dtt, id.vars = c('id', 'age'))[!is.na(value)]
dtt2[, c('var1', 'var2', 'var3', 'var4', 'cond') := tstrsplit(variable, '(?!^)(?=[A-Z0-9])', perl = T)]
dtt2[, variable := NULL]
dcast(dtt2, ... ~ cond, value.var = 'value')
# id age var1 var2 var3 var4 1 2 3 4 5 6 7
# 1: 1 20 Pot Hi Up One 10 10 10 10 10 10 10
# 2: 2 20 Pot Hi Up Two 10 10 10 10 10 10 10
# 3: 3 20 Pot Hi Down One 10 10 10 10 10 10 10
# 4: 4 20 Pot Hi Down Two 10 10 10 10 10 10 10
# 5: 5 20 Pot Low Up One 10 10 10 10 10 10 10
# 6: 6 20 Pot Low Up Two 10 10 10 10 10 10 10
# 7: 7 20 Pot Low Down One 10 10 10 10 10 10 10
# 8: 8 20 Pot Low Down Two 10 10 10 10 10 10 10
# 9: 9 20 Pan Hi Up One 10 10 10 10 10 10 10
# 10: 10 20 Pan Hi Up Two 10 10 10 10 10 10 10
# 11: 11 20 Pan Hi Down One 10 10 10 10 10 10 10
# 12: 12 20 Pan Hi Down Two 10 10 10 10 10 10 10
# 13: 13 20 Pan Low Up One 10 10 10 10 10 10 10
# 14: 14 20 Pan Low Up Two 10 10 10 10 10 10 10
# 15: 15 20 Pan Low Down One 10 10 10 10 10 10 10
# 16: 16 20 Pan Low Down Two 10 10 10 10 10 10 10
好吧,这不像 mt1022 的解决方案那么干净,但它不需要 data.table
。 case_when
函数需要 dplyr
,其他所有函数需要 base
。
定义两个新函数,find_conditions
和 transform
。
find_conditions
有点笨重,但可能会有用,因为您可以根据需要轻松添加新定义。
find_conditions <- function(x){
x1 <- x
x1 <- case_when(
x1 == "PotHiUpOne" ~ c("pot", "hi", "up", "one"),
x1 == "PotHiUpTwo" ~ c("pot", "hi", "up", "two"),
x1 == "PotHiDownOne" ~ c("pot", "hi", "down", "one"),
x1 == "PotHiDownTwo" ~ c("pot", "hi", "down", "two"),
x1 == "PotLowUpOne" ~ c("pot", "low", "up", "one"),
x1 == "PotLowUpTwo" ~ c("pot", "low", "up", "two"),
x1 == "PotLowDownOne" ~ c("pot", "low", "down", "one"),
x1 == "PotLowDownTwo" ~ c("pot", "low", "down", "two"),
x1 == "PanHiUpOne" ~ c("pan", "hi", "up", "one"),
x1 == "PanHiUpTwo" ~ c("pan", "hi", "up", "two"),
x1 == "PanHiDownOne" ~ c("pan", "hi", "down", "one"),
x1 == "PanHiDownTwo" ~ c("pan", "hi", "down", "two"),
x1 == "PanLowUpOne" ~ c("pan", "low", "up", "one"),
x1 == "PanLowUpTwo" ~ c("pan", "low", "up", "two"),
x1 == "PanLowDownOne" ~ c("pan", "low", "down", "one"),
x1 == "PanLowDownTwo" ~ c("pan", "low", "down", "two")
)
if(NA %in% x1){
cat("Error: Input not recognized")
}
else{
return(x1)
}
}
transform
从 df
中获取行并将其转换为我们想要的形式。这取决于我们已经定义的 find_conditions
函数。
transform <- function(row){
row1 <- row[3:length(row)] # Forget about id and age columns, will put them back at the end
cols <- colnames(row1)[!is.na(row1)] # Get names of the columns which are not NA
cols <- substr(cols,1,nchar(cols)-1) # Slice off the last character (The number)
cols <- cols[!duplicated(cols)] # Columns should all have the same name now - find it by removing duplicates
vars <- find_conditions(cols) # Use our new find_conditions function to break it up into individual conditions
row1 <- row1[!is.na(row1)] # Keep only non-NA values
new_row <- c(row[1:2],row1,vars) # put id, age, row1, vars together
as.vector(unlist(new_row)) # Return as an unnamed vector
}
现在使用这两个函数就很简单了:
l1 <- list() # Initialize empty list
for (i in 1:nrow(df)){
l1[[i]] <- transform(df[i,]) # Fill list with transformed rows
}
DF1 <- data.frame(do.call("rbind",l1)) # Bind the transformed rows together
如您所说,它不是一个大数据集,将其留在循环中。祝你好运!