R - 根据同一行不同列的值填充缺失值(空白)
R - Filling missing values (blanks) based upon values on the same row but different column
我正在使用 R 并有以下数据框样本,其中所有变量都是因子:
first second third
social birth control high
birth control high
medical Anorexia Nervosa low
medical Anorexia Nervosa low
Alcoholism high
family Alcoholism high
基本上,我需要一个函数来帮助我根据第二列和第三列中的值填充第一列中的空白。
例如,如果我在第二列中有 "birth control",在第三列中有 "high",我需要用 "social" 填充第一列中的空白。如果第二列和第三列分别是"Alcoholism"和"high",我需要用"family"填充第一列的空白。
一种方法是创建某种查找列表(例如,使用命名向量、factor
或类似的东西),然后用来自的值替换任何 ""
值查找列表。
这是一个例子(虽然我认为你的问题没有完全定义并且可能过度简化)。
library(dplyr)
library(tidyr)
mydf %>%
unite(condition, second, third, remove = FALSE) %>%
mutate(condition = factor(condition,
c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
c("social", "medical", "family"))) %>%
mutate(condition = as.character(condition)) %>%
mutate(first = replace(first, first == "", condition[first == ""])) %>%
select(-condition)
# first second third
# 1 social birth control high
# 2 social birth control high
# 3 medical Anorexia Nervosa low
# 4 medical Anorexia Nervosa low
# 5 family Alcoholism high
# 6 family Alcoholism high
"data.table" 方法将遵循相同的步骤,但具有通过引用修改而不是复制的优势。
library(data.table)
as.data.table(mydf)[
, condition := sprintf("%s_%s", second, third)][
, condition := as.character(
factor(condition,
c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
c("social", "medical", "family")))][
first == "", first := condition][
, condition := NULL][]
根据显示的数据,对于 'second' 和 'third' 的每个组合,您在 'first' 中是否还有其他值不是很清楚。如果只有一个值,您需要用它替换 ''
,那么您可以尝试
library(data.table)
setDT(df1)[, replace(first, first=='', first[first!='']),
list(second, third)]
或者更有效的方法是
setDT(df1)[, first:= first[first!=''] , list(second, third)]
# first second third
#1: social birth control high
#2: social birth control high
#3: medical Anorexia Nervosa low
#4: medical Anorexia Nervosa low
#5: family Alcoholism high
#6: family Alcoholism high
数据
df1 <- structure(list(first = c("social", "", "medical", "medical",
"", "family"), second = c("birth control", "birth control",
"Anorexia Nervosa",
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high",
"high", "low", "low", "high", "high")), .Names = c("first", "second",
"third"), class = "data.frame", row.names = c(NA, -6L))
另一种方法 dplyr
使用 @akrun 非常好的解决方案
library(dplyr)
df1 %>% group_by(second, third) %>%
mutate(first=replace(first, first=='', first[first!=''])) %>% ungroup
数据
df1 <- structure(list(first = c("social", "", "medical", "medical",
"", "family"), second = c("birth control", "birth control",
"Anorexia Nervosa",
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high",
"high", "low", "low", "high", "high")), .Names = c("first", "second",
"third"), class = "data.frame", row.names = c(NA, -6L))
我正在使用 R 并有以下数据框样本,其中所有变量都是因子:
first second third
social birth control high
birth control high
medical Anorexia Nervosa low
medical Anorexia Nervosa low
Alcoholism high
family Alcoholism high
基本上,我需要一个函数来帮助我根据第二列和第三列中的值填充第一列中的空白。 例如,如果我在第二列中有 "birth control",在第三列中有 "high",我需要用 "social" 填充第一列中的空白。如果第二列和第三列分别是"Alcoholism"和"high",我需要用"family"填充第一列的空白。
一种方法是创建某种查找列表(例如,使用命名向量、factor
或类似的东西),然后用来自的值替换任何 ""
值查找列表。
这是一个例子(虽然我认为你的问题没有完全定义并且可能过度简化)。
library(dplyr)
library(tidyr)
mydf %>%
unite(condition, second, third, remove = FALSE) %>%
mutate(condition = factor(condition,
c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
c("social", "medical", "family"))) %>%
mutate(condition = as.character(condition)) %>%
mutate(first = replace(first, first == "", condition[first == ""])) %>%
select(-condition)
# first second third
# 1 social birth control high
# 2 social birth control high
# 3 medical Anorexia Nervosa low
# 4 medical Anorexia Nervosa low
# 5 family Alcoholism high
# 6 family Alcoholism high
"data.table" 方法将遵循相同的步骤,但具有通过引用修改而不是复制的优势。
library(data.table)
as.data.table(mydf)[
, condition := sprintf("%s_%s", second, third)][
, condition := as.character(
factor(condition,
c("birth control_high", "Anorexia Nervosa_low", "Alcoholism_high"),
c("social", "medical", "family")))][
first == "", first := condition][
, condition := NULL][]
根据显示的数据,对于 'second' 和 'third' 的每个组合,您在 'first' 中是否还有其他值不是很清楚。如果只有一个值,您需要用它替换 ''
,那么您可以尝试
library(data.table)
setDT(df1)[, replace(first, first=='', first[first!='']),
list(second, third)]
或者更有效的方法是
setDT(df1)[, first:= first[first!=''] , list(second, third)]
# first second third
#1: social birth control high
#2: social birth control high
#3: medical Anorexia Nervosa low
#4: medical Anorexia Nervosa low
#5: family Alcoholism high
#6: family Alcoholism high
数据
df1 <- structure(list(first = c("social", "", "medical", "medical",
"", "family"), second = c("birth control", "birth control",
"Anorexia Nervosa",
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high",
"high", "low", "low", "high", "high")), .Names = c("first", "second",
"third"), class = "data.frame", row.names = c(NA, -6L))
另一种方法 dplyr
使用 @akrun 非常好的解决方案
library(dplyr)
df1 %>% group_by(second, third) %>%
mutate(first=replace(first, first=='', first[first!=''])) %>% ungroup
数据
df1 <- structure(list(first = c("social", "", "medical", "medical",
"", "family"), second = c("birth control", "birth control",
"Anorexia Nervosa",
"Anorexia Nervosa", "Alcoholism", "Alcoholism"), third = c("high",
"high", "low", "low", "high", "high")), .Names = c("first", "second",
"third"), class = "data.frame", row.names = c(NA, -6L))