在我的数据框中以其他列为条件填充一列,并使用第三列中的值
fill a column in my data frame conditional on other column, with values from third column
警告:这个问题看起来太简单了,我作为初学者可能没能在 SO 上更复杂的主题中找到正确的解决方案(查看 , , here 和更多地方)
我想根据另一列在我的数据框中填充一列,并将更多列用作输入。
举个例子就更清楚了:
Version1 Version2 Version3 Version4 Presented_version Color
1 blue red green yellow 1 NA
2 red blue yellow green 4 NA
3 yellow green red blue 3 NA
我想用 Version1/Version2/Version3/Version 的值填充“Color”列 4. Presented_version 告诉我需要这四个值中的哪一个。
例如,在第 1 行中,Presented_version 为 1,因此需要的值在 "Version1"("blue")中。第 1 行的颜色应为蓝色。
有人可以告诉我一种无需使用大量 "if" 语句遍历数据帧即可执行此操作的方法吗?
structure(list(Version1 = structure(1:3, .Label = c("blue", "red",
"yellow"), class = "factor"), Version2 = structure(c(3L, 1L,
2L), .Label = c("blue", "green", "red"), class = "factor"), Version3 = structure(c(1L,
3L, 2L), .Label = c("green", "red", "yellow"), class = "factor"),
Version4 = structure(3:1, .Label = c("blue", "green", "yellow"
), class = "factor"), Presented_version = c(1L, 4L, 3L),
Color = c(NA, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
=======================
已编辑!
我简化了示例来解释我的问题,但上面的示例在几个方面与我的实际数据集不同,因此解决方案做出了我的数据实际上不符合的假设。
这是 data.frame 的更准确表示。特别是 Presented_version 和 Version1...Version 4 列的内容之间没有固定匹配(根据额外的列而有所不同,我现在称之为 Painter),并且 Version1 到 Version4 不一定在我数据集中的第 1 到 4 列。
FillerColumn Painter Version1 Version2 Version3 Version4 Version_presented Color FillerColumn.1
1 77 A blue red green yellow 1 NA 77
2 77 B red blue yellow green 4 NA 77
3 77 C yellow green red blue 3 NA 77
4 77 D red blue yellow green 1 NA 77
structure(list(FillerColumn = c(77L, 77L, 77L, 77L), Painter = structure(1:4, .Label = c("A",
"B", "C", "D"), class = "factor"), Version1 = structure(c(1L,
2L, 3L, 2L), .Label = c("blue", "red", "yellow"), class = "factor"),
Version2 = structure(c(3L, 1L, 2L, 1L), .Label = c("blue",
"green", "red"), class = "factor"), Version3 = structure(c(1L,
3L, 2L, 3L), .Label = c("green", "red", "yellow"), class = "factor"),
Version4 = structure(c(3L, 2L, 1L, 2L), .Label = c("blue",
"green", "yellow"), class = "factor"), Version_presented = c(1L,
4L, 3L, 1L), Color = c(NA, NA, NA, NA), FillerColumn.1 = c(77L,
77L, 77L, 77L)), class = "data.frame", row.names = c(NA,
-4L))
使用mapply
的一种方式
cols <- grep("^Version", names(df))
df$Color <- unlist(mapply(function(x, y) df[x, cols][y],
1:nrow(df),df$Presented_version))
df
# Version1 Version2 Version3 Version4 Presented_version Color
#1 blue red green yellow 1 blue
#2 red blue yellow green 4 green
#3 yellow green red blue 3 red
和 apply
apply(df, 1, function(x) x[cols][as.numeric(x["Presented_version"])])
#[1] "blue" "green" "red"
我们可以使用带有 row/column
索引的矢量化选项来提取值而不是任何循环
df1$color <- df1[1:4][cbind(1:nrow(df1), df1$Presented_version)]
df1$color
#[1] "blue" "green" "red"
基准
dfN <- df1[rep(seq_len(nrow(df1)), 1e6),]
system.time({
dfN[1:4][cbind(1:nrow(dfN), dfN$Presented_version)]
})
# user system elapsed
# 1.216 0.110 1.321
system.time({
cols <- grep("^Version", names(dfN))
unlist(mapply(function(x, y) dfN[x, cols][y],
1:nrow(dfN),dfN$Presented_version))
})
# user system elapsed
# 319.907 1.644 322.418
现在,让我们看看 apply
的另一个选项
system.time({
apply(dfN, 1, function(x) x[cols][as.numeric(x["Presented_version"])])
})
# user system elapsed
# 14.240 0.365 14.550
我喜欢弄乱数据集。尝试 data.table melt
方法
df <- setDT(df)
df1 <- melt.data.table(df,
id.vars = c('Presented_version'),
measure.vars = patterns('Version'),
value.name = 'Color',
variable.name = 'Version')[
, version1 := str_extract(Version, '\d+')][
Presented_version == version1][
version1 := NULL]
导致
Presented_version Version Color
1: 1 Version1 blue
2: 3 Version3 red
3: 4 Version4 green
而且,如果您希望信息采用相同的原始结构
merge(df,
df1[, .(Presented_version, Color)],
by = 'Presented_version')
Presented_version Version1 Version2 Version3 Version4 Color
1: 1 blue red green yellow blue
2: 3 yellow green red blue red
3: 4 red blue yellow green green
警告:这个问题看起来太简单了,我作为初学者可能没能在 SO 上更复杂的主题中找到正确的解决方案(查看
我想根据另一列在我的数据框中填充一列,并将更多列用作输入。 举个例子就更清楚了:
Version1 Version2 Version3 Version4 Presented_version Color
1 blue red green yellow 1 NA
2 red blue yellow green 4 NA
3 yellow green red blue 3 NA
我想用 Version1/Version2/Version3/Version 的值填充“Color”列 4. Presented_version 告诉我需要这四个值中的哪一个。 例如,在第 1 行中,Presented_version 为 1,因此需要的值在 "Version1"("blue")中。第 1 行的颜色应为蓝色。
有人可以告诉我一种无需使用大量 "if" 语句遍历数据帧即可执行此操作的方法吗?
structure(list(Version1 = structure(1:3, .Label = c("blue", "red",
"yellow"), class = "factor"), Version2 = structure(c(3L, 1L,
2L), .Label = c("blue", "green", "red"), class = "factor"), Version3 = structure(c(1L,
3L, 2L), .Label = c("green", "red", "yellow"), class = "factor"),
Version4 = structure(3:1, .Label = c("blue", "green", "yellow"
), class = "factor"), Presented_version = c(1L, 4L, 3L),
Color = c(NA, NA, NA)), class = "data.frame", row.names = c(NA,
-3L))
======================= 已编辑!
我简化了示例来解释我的问题,但上面的示例在几个方面与我的实际数据集不同,因此解决方案做出了我的数据实际上不符合的假设。 这是 data.frame 的更准确表示。特别是 Presented_version 和 Version1...Version 4 列的内容之间没有固定匹配(根据额外的列而有所不同,我现在称之为 Painter),并且 Version1 到 Version4 不一定在我数据集中的第 1 到 4 列。
FillerColumn Painter Version1 Version2 Version3 Version4 Version_presented Color FillerColumn.1
1 77 A blue red green yellow 1 NA 77
2 77 B red blue yellow green 4 NA 77
3 77 C yellow green red blue 3 NA 77
4 77 D red blue yellow green 1 NA 77
structure(list(FillerColumn = c(77L, 77L, 77L, 77L), Painter = structure(1:4, .Label = c("A",
"B", "C", "D"), class = "factor"), Version1 = structure(c(1L,
2L, 3L, 2L), .Label = c("blue", "red", "yellow"), class = "factor"),
Version2 = structure(c(3L, 1L, 2L, 1L), .Label = c("blue",
"green", "red"), class = "factor"), Version3 = structure(c(1L,
3L, 2L, 3L), .Label = c("green", "red", "yellow"), class = "factor"),
Version4 = structure(c(3L, 2L, 1L, 2L), .Label = c("blue",
"green", "yellow"), class = "factor"), Version_presented = c(1L,
4L, 3L, 1L), Color = c(NA, NA, NA, NA), FillerColumn.1 = c(77L,
77L, 77L, 77L)), class = "data.frame", row.names = c(NA,
-4L))
使用mapply
cols <- grep("^Version", names(df))
df$Color <- unlist(mapply(function(x, y) df[x, cols][y],
1:nrow(df),df$Presented_version))
df
# Version1 Version2 Version3 Version4 Presented_version Color
#1 blue red green yellow 1 blue
#2 red blue yellow green 4 green
#3 yellow green red blue 3 red
和 apply
apply(df, 1, function(x) x[cols][as.numeric(x["Presented_version"])])
#[1] "blue" "green" "red"
我们可以使用带有 row/column
索引的矢量化选项来提取值而不是任何循环
df1$color <- df1[1:4][cbind(1:nrow(df1), df1$Presented_version)]
df1$color
#[1] "blue" "green" "red"
基准
dfN <- df1[rep(seq_len(nrow(df1)), 1e6),]
system.time({
dfN[1:4][cbind(1:nrow(dfN), dfN$Presented_version)]
})
# user system elapsed
# 1.216 0.110 1.321
system.time({
cols <- grep("^Version", names(dfN))
unlist(mapply(function(x, y) dfN[x, cols][y],
1:nrow(dfN),dfN$Presented_version))
})
# user system elapsed
# 319.907 1.644 322.418
现在,让我们看看 apply
system.time({
apply(dfN, 1, function(x) x[cols][as.numeric(x["Presented_version"])])
})
# user system elapsed
# 14.240 0.365 14.550
我喜欢弄乱数据集。尝试 data.table melt
方法
df <- setDT(df)
df1 <- melt.data.table(df,
id.vars = c('Presented_version'),
measure.vars = patterns('Version'),
value.name = 'Color',
variable.name = 'Version')[
, version1 := str_extract(Version, '\d+')][
Presented_version == version1][
version1 := NULL]
导致
Presented_version Version Color
1: 1 Version1 blue
2: 3 Version3 red
3: 4 Version4 green
而且,如果您希望信息采用相同的原始结构
merge(df,
df1[, .(Presented_version, Color)],
by = 'Presented_version')
Presented_version Version1 Version2 Version3 Version4 Color
1: 1 blue red green yellow blue
2: 3 yellow green red blue red
3: 4 red blue yellow green green