检查多个 NA 列和 return R 中的另一列
Check for multiple NA columns and return another column in R
我有一个包含多个列的数据框,这些列名为“avg_metric”、“wkday_avg_metric”、“ event_avg_metric”和“monthly_avg_metric”,其中“metric" 由具有这些计算的多个指标(订单、收入等)组成。如果它们的行有 NA,我必须检查多列,并将它们替换为另一列的行。为此,我创建了一个函数,它对我指定的“度量”列进行相同的验证。问题是我为我正在创建的整个新列获得了相同的值,但事实并非如此。
我在下面添加了关于结果的 example_fixed。
有更简单的方法吗?还是我在函数中缺少一些逻辑?
感谢。
编辑:我的函数有错误,但我确信我有更好的解决方案。我尝试了您的解决方案,但无法将它们应用于我的数据框。我更新了 reprex,这样你可以更好地帮助我。
library(tidyverse)
(example <- tibble(country = c("A", "B", "C", "D"),
brand = c("A", "A", "B", "B"),
event = c(1:4),
month = c(1:4),
weekday = c(1:4),
avg_visits = c(5028, NA, NA, NA),
avg_revenue = c(12345, NA, NA, NA),
wkday_avg_visits = c(1234, 4355, NA, NA),
wkday_avg_revenue = c(12345, 54321, NA, NA),
event_avg_visits = c(51271, 59212, 98773, NA),
event_avg_revenue = c(98764, 56435, 35634, NA),
monthly_avg_visits = c(5028, 5263, 6950, 8902),
monthly_avg_revenue = c(63457, 34536, 34574, 23426))) %>%
print(width = Inf)
#> # A tibble: 4 x 13
#> country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 A A 1 1 1 5028 12345 1234
#> 2 B A 2 2 2 NA NA 4355
#> 3 C B 3 3 3 NA NA NA
#> 4 D B 4 4 4 NA NA NA
#> wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#> <dbl> <dbl> <dbl> <dbl>
#> 1 12345 51271 98764 5028
#> 2 54321 59212 56435 5263
#> 3 NA 98773 35634 6950
#> 4 NA NA NA 8902
#> monthly_avg_revenue
#> <dbl>
#> 1 63457
#> 2 34536
#> 3 34574
#> 4 23426
subs_metric <- function(data, metric) {
avg <- paste0("avg_", metric)
wkday_avg <- paste0("wkday_avg_", metric)
event_avg <- paste0("event_avg_", metric)
monthly_avg <- paste0("monthly_avg_", metric)
for (i in nrow(data)) {
value <- if (is.na(data[[avg]][i]) & is.na(data[[wkday_avg]][i]) & is.na(data[[event_avg]][i])) {
data[[monthly_avg]][i]
} else if (is.na(data[[avg]][i]) & is.na(data[[wkday_avg]][i])) {
data[[event_avg]][i]
} else if (is.na(data[[avg]][i])) {
data[[wkday_avg]][i]
} else {
data[[avg]][i]
}
return(value)
}
}
example %>%
mutate(avg_visits_new = subs_metric(., "visits"),
avg_revenue_new = subs_metric(., "revenue")) %>%
print(width = Inf)
#> # A tibble: 4 x 15
#> country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 A A 1 1 1 5028 12345 1234
#> 2 B A 2 2 2 NA NA 4355
#> 3 C B 3 3 3 NA NA NA
#> 4 D B 4 4 4 NA NA NA
#> wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#> <dbl> <dbl> <dbl> <dbl>
#> 1 12345 51271 98764 5028
#> 2 54321 59212 56435 5263
#> 3 NA 98773 35634 6950
#> 4 NA NA NA 8902
#> monthly_avg_revenue avg_visits_new avg_revenue_new
#> <dbl> <dbl> <dbl>
#> 1 63457 8902 23426
#> 2 34536 8902 23426
#> 3 34574 8902 23426
#> 4 23426 8902 23426
(example_fixed <- tibble(country = c("A", "B", "C", "D"),
brand = c("A", "A", "B", "B"),
event = c(1:4),
month = c(1:4),
weekday = c(1:4),
avg_visits = c(5028, NA, NA, NA),
avg_revenue = c(12345, NA, NA, NA),
wkday_avg_visits = c(1234, 4355, NA, NA),
wkday_avg_revenue = c(12345, 54321, NA, NA),
event_avg_visits = c(51271, 59212, 98773, NA),
event_avg_revenue = c(98764, 56435, 35634, NA),
monthly_avg_visits = c(5028, 5263, 6950, 8902),
monthly_avg_revenue = c(63457, 34536, 34574, 23426),
avg_visits_new = c(5028, 4355, 98773, 8902),
avg_revenue_new = c(12345, 54321, 35634, 23426))) %>%
print(width = Inf)
#> # A tibble: 4 x 15
#> country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 A A 1 1 1 5028 12345 1234
#> 2 B A 2 2 2 NA NA 4355
#> 3 C B 3 3 3 NA NA NA
#> 4 D B 4 4 4 NA NA NA
#> wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#> <dbl> <dbl> <dbl> <dbl>
#> 1 12345 51271 98764 5028
#> 2 54321 59212 56435 5263
#> 3 NA 98773 35634 6950
#> 4 NA NA NA 8902
#> monthly_avg_revenue avg_visits_new avg_revenue_new
#> <dbl> <dbl> <dbl>
#> 1 63457 5028 12345
#> 2 34536 4355 54321
#> 3 34574 98773 35634
#> 4 23426 8902 23426
由 reprex package (v0.3.0)
于 2020-07-07 创建
我们可以使用以下内容
example$avg_visits_new <- apply(example,1,function(x) x[!is.na(x)][1])
# A tibble: 4 x 5
avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
<dbl> <dbl> <dbl> <dbl> <dbl>
1 5028 1234 51271 5028 5028
2 NA 4355 59212 5263 4355
3 NA NA 98773 6950 98773
4 NA NA NA 8902 8902
这只是一行一行地使用它找到的第一个非NA
值
编辑:
这是一个循环,它将在所有指标上添加回收上述代码。
metric <- unique(sub(".*_(.*)","\1",colnames(example)[-(1:5)]))
for(i in metric){
example <- cbind(example, print(apply(example[,grepl(i,colnames(example))],1,function(x) x[!is.na(x)][1])))
}
colnames(example)[(ncol(example)-length(metric)+1):ncol(example)] <- paste0("avg_",metric,"_new")
> example
country brand event month weekday avg_visits avg_revenue wkday_avg_visits wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits monthly_avg_revenue avg_visits_new avg_revenue_new
1 A A 1 1 1 5028 12345 1234 12345 51271 98764 5028 63457 5028 12345
2 B A 2 2 2 NA NA 4355 54321 59212 56435 5263 34536 4355 54321
3 C B 3 3 3 NA NA NA NA 98773 35634 6950 34574 98773 35634
4 D B 4 4 4 NA NA NA NA NA NA 8902 23426 8902 23426
有更好的方法可以做到这一点,例如,您可以将整个函数替换为:
subs_metric <- function(data, metric)
{
data.table::fcoalesce(data[grep(metric, names(data)), ])
}
哪个给出了正确的结果:
example %>%
mutate(avg_visits_new = subs_metric(., "visits"))
#> # A tibble: 4 x 5
#> avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5028 1234 51271 5028 5028
#> 2 NA 4355 59212 5263 4355
#> 3 NA NA 98773 6950 98773
#> 4 NA NA NA 8902 8902
但是,我敢肯定您想知道代码中的哪些缺陷导致循环无法按预期运行。
首先,您的循环从 for (i in nrow(data))
开始。由于您的数据框中有 4 行,这意味着 for (i in 4)
。这意味着循环只有 运行s 一次 i
设置为 4。我想你的意思是 for (i in 1:nrow(data))
其次,您正在 returning value
循环中。这意味着任何时候循环 运行s,它只会 运行 一次,函数将 return value
。我认为这只是一个错位的大括号。
第三,您要在循环的每次迭代中覆盖 value
,您希望 value
成为构成新列的向量,因此您需要声明 value
并为循环的每次迭代写入 value[i]
。
结合这些变化,我们有:
subs_metric <- function(data, metric) {
avg <- paste0("avg_", metric)
wkday_avg <- paste0("wkday_avg_", metric)
event_avg <- paste0("event_avg_", metric)
monthly_avg <- paste0("monthly_avg_", metric)
value <- numeric(nrow(data))
for (i in 1:nrow(data)) {
value[i] <- if (is.na(data[[avg]][i]) &
is.na(data[[wkday_avg]][i]) &
is.na(data[[event_avg]][i])) {
data[[monthly_avg]][i]
} else if (is.na(data[[avg]][i]) &
is.na(data[[wkday_avg]][i])) {
data[[event_avg]][i]
} else if (is.na(data[[avg]][i])) {
data[[wkday_avg]][i]
} else {
data[[avg]][i]
}
}
return(value)
}
现在给出正确的结果:
example %>%
mutate(avg_visits_new = subs_metric(., "visits"))
#> # A tibble: 4 x 5
#> avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5028 1234 51271 5028 5028
#> 2 NA 4355 59212 5263 4355
#> 3 NA NA 98773 6950 98773
#> 4 NA NA NA 8902 8902
但是,我可能会坚持使用提供的其他解决方案之一,因为它们比逐行循环短得多且效率更高。
我有一个包含多个列的数据框,这些列名为“avg_metric”、“wkday_avg_metric”、“ event_avg_metric”和“monthly_avg_metric”,其中“metric" 由具有这些计算的多个指标(订单、收入等)组成。如果它们的行有 NA,我必须检查多列,并将它们替换为另一列的行。为此,我创建了一个函数,它对我指定的“度量”列进行相同的验证。问题是我为我正在创建的整个新列获得了相同的值,但事实并非如此。
我在下面添加了关于结果的 example_fixed。
有更简单的方法吗?还是我在函数中缺少一些逻辑?
感谢。
编辑:我的函数有错误,但我确信我有更好的解决方案。我尝试了您的解决方案,但无法将它们应用于我的数据框。我更新了 reprex,这样你可以更好地帮助我。
library(tidyverse)
(example <- tibble(country = c("A", "B", "C", "D"),
brand = c("A", "A", "B", "B"),
event = c(1:4),
month = c(1:4),
weekday = c(1:4),
avg_visits = c(5028, NA, NA, NA),
avg_revenue = c(12345, NA, NA, NA),
wkday_avg_visits = c(1234, 4355, NA, NA),
wkday_avg_revenue = c(12345, 54321, NA, NA),
event_avg_visits = c(51271, 59212, 98773, NA),
event_avg_revenue = c(98764, 56435, 35634, NA),
monthly_avg_visits = c(5028, 5263, 6950, 8902),
monthly_avg_revenue = c(63457, 34536, 34574, 23426))) %>%
print(width = Inf)
#> # A tibble: 4 x 13
#> country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 A A 1 1 1 5028 12345 1234
#> 2 B A 2 2 2 NA NA 4355
#> 3 C B 3 3 3 NA NA NA
#> 4 D B 4 4 4 NA NA NA
#> wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#> <dbl> <dbl> <dbl> <dbl>
#> 1 12345 51271 98764 5028
#> 2 54321 59212 56435 5263
#> 3 NA 98773 35634 6950
#> 4 NA NA NA 8902
#> monthly_avg_revenue
#> <dbl>
#> 1 63457
#> 2 34536
#> 3 34574
#> 4 23426
subs_metric <- function(data, metric) {
avg <- paste0("avg_", metric)
wkday_avg <- paste0("wkday_avg_", metric)
event_avg <- paste0("event_avg_", metric)
monthly_avg <- paste0("monthly_avg_", metric)
for (i in nrow(data)) {
value <- if (is.na(data[[avg]][i]) & is.na(data[[wkday_avg]][i]) & is.na(data[[event_avg]][i])) {
data[[monthly_avg]][i]
} else if (is.na(data[[avg]][i]) & is.na(data[[wkday_avg]][i])) {
data[[event_avg]][i]
} else if (is.na(data[[avg]][i])) {
data[[wkday_avg]][i]
} else {
data[[avg]][i]
}
return(value)
}
}
example %>%
mutate(avg_visits_new = subs_metric(., "visits"),
avg_revenue_new = subs_metric(., "revenue")) %>%
print(width = Inf)
#> # A tibble: 4 x 15
#> country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 A A 1 1 1 5028 12345 1234
#> 2 B A 2 2 2 NA NA 4355
#> 3 C B 3 3 3 NA NA NA
#> 4 D B 4 4 4 NA NA NA
#> wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#> <dbl> <dbl> <dbl> <dbl>
#> 1 12345 51271 98764 5028
#> 2 54321 59212 56435 5263
#> 3 NA 98773 35634 6950
#> 4 NA NA NA 8902
#> monthly_avg_revenue avg_visits_new avg_revenue_new
#> <dbl> <dbl> <dbl>
#> 1 63457 8902 23426
#> 2 34536 8902 23426
#> 3 34574 8902 23426
#> 4 23426 8902 23426
(example_fixed <- tibble(country = c("A", "B", "C", "D"),
brand = c("A", "A", "B", "B"),
event = c(1:4),
month = c(1:4),
weekday = c(1:4),
avg_visits = c(5028, NA, NA, NA),
avg_revenue = c(12345, NA, NA, NA),
wkday_avg_visits = c(1234, 4355, NA, NA),
wkday_avg_revenue = c(12345, 54321, NA, NA),
event_avg_visits = c(51271, 59212, 98773, NA),
event_avg_revenue = c(98764, 56435, 35634, NA),
monthly_avg_visits = c(5028, 5263, 6950, 8902),
monthly_avg_revenue = c(63457, 34536, 34574, 23426),
avg_visits_new = c(5028, 4355, 98773, 8902),
avg_revenue_new = c(12345, 54321, 35634, 23426))) %>%
print(width = Inf)
#> # A tibble: 4 x 15
#> country brand event month weekday avg_visits avg_revenue wkday_avg_visits
#> <chr> <chr> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 A A 1 1 1 5028 12345 1234
#> 2 B A 2 2 2 NA NA 4355
#> 3 C B 3 3 3 NA NA NA
#> 4 D B 4 4 4 NA NA NA
#> wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits
#> <dbl> <dbl> <dbl> <dbl>
#> 1 12345 51271 98764 5028
#> 2 54321 59212 56435 5263
#> 3 NA 98773 35634 6950
#> 4 NA NA NA 8902
#> monthly_avg_revenue avg_visits_new avg_revenue_new
#> <dbl> <dbl> <dbl>
#> 1 63457 5028 12345
#> 2 34536 4355 54321
#> 3 34574 98773 35634
#> 4 23426 8902 23426
由 reprex package (v0.3.0)
于 2020-07-07 创建我们可以使用以下内容
example$avg_visits_new <- apply(example,1,function(x) x[!is.na(x)][1])
# A tibble: 4 x 5
avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
<dbl> <dbl> <dbl> <dbl> <dbl>
1 5028 1234 51271 5028 5028
2 NA 4355 59212 5263 4355
3 NA NA 98773 6950 98773
4 NA NA NA 8902 8902
这只是一行一行地使用它找到的第一个非NA
值
编辑: 这是一个循环,它将在所有指标上添加回收上述代码。
metric <- unique(sub(".*_(.*)","\1",colnames(example)[-(1:5)]))
for(i in metric){
example <- cbind(example, print(apply(example[,grepl(i,colnames(example))],1,function(x) x[!is.na(x)][1])))
}
colnames(example)[(ncol(example)-length(metric)+1):ncol(example)] <- paste0("avg_",metric,"_new")
> example
country brand event month weekday avg_visits avg_revenue wkday_avg_visits wkday_avg_revenue event_avg_visits event_avg_revenue monthly_avg_visits monthly_avg_revenue avg_visits_new avg_revenue_new
1 A A 1 1 1 5028 12345 1234 12345 51271 98764 5028 63457 5028 12345
2 B A 2 2 2 NA NA 4355 54321 59212 56435 5263 34536 4355 54321
3 C B 3 3 3 NA NA NA NA 98773 35634 6950 34574 98773 35634
4 D B 4 4 4 NA NA NA NA NA NA 8902 23426 8902 23426
有更好的方法可以做到这一点,例如,您可以将整个函数替换为:
subs_metric <- function(data, metric)
{
data.table::fcoalesce(data[grep(metric, names(data)), ])
}
哪个给出了正确的结果:
example %>%
mutate(avg_visits_new = subs_metric(., "visits"))
#> # A tibble: 4 x 5
#> avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5028 1234 51271 5028 5028
#> 2 NA 4355 59212 5263 4355
#> 3 NA NA 98773 6950 98773
#> 4 NA NA NA 8902 8902
但是,我敢肯定您想知道代码中的哪些缺陷导致循环无法按预期运行。
首先,您的循环从 for (i in nrow(data))
开始。由于您的数据框中有 4 行,这意味着 for (i in 4)
。这意味着循环只有 运行s 一次 i
设置为 4。我想你的意思是 for (i in 1:nrow(data))
其次,您正在 returning value
循环中。这意味着任何时候循环 运行s,它只会 运行 一次,函数将 return value
。我认为这只是一个错位的大括号。
第三,您要在循环的每次迭代中覆盖 value
,您希望 value
成为构成新列的向量,因此您需要声明 value
并为循环的每次迭代写入 value[i]
。
结合这些变化,我们有:
subs_metric <- function(data, metric) {
avg <- paste0("avg_", metric)
wkday_avg <- paste0("wkday_avg_", metric)
event_avg <- paste0("event_avg_", metric)
monthly_avg <- paste0("monthly_avg_", metric)
value <- numeric(nrow(data))
for (i in 1:nrow(data)) {
value[i] <- if (is.na(data[[avg]][i]) &
is.na(data[[wkday_avg]][i]) &
is.na(data[[event_avg]][i])) {
data[[monthly_avg]][i]
} else if (is.na(data[[avg]][i]) &
is.na(data[[wkday_avg]][i])) {
data[[event_avg]][i]
} else if (is.na(data[[avg]][i])) {
data[[wkday_avg]][i]
} else {
data[[avg]][i]
}
}
return(value)
}
现在给出正确的结果:
example %>%
mutate(avg_visits_new = subs_metric(., "visits"))
#> # A tibble: 4 x 5
#> avg_visits wkday_avg_visits event_avg_visits monthly_avg_visits avg_visits_new
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5028 1234 51271 5028 5028
#> 2 NA 4355 59212 5263 4355
#> 3 NA NA 98773 6950 98773
#> 4 NA NA NA 8902 8902
但是,我可能会坚持使用提供的其他解决方案之一,因为它们比逐行循环短得多且效率更高。