R 中按国家/地区拆分的数据框中的 NA
NAs in a data frame split by country in R
我想用每个国家/地区的观察数据来估算数据框中的 NA。换句话说,在处理 NA 时,应考虑特定国家/地区的价值观。例如;
Date Country Battles Riots
March 2018 Afghanistan 380 NA
March 2018 Yemen 88 5
March 2018 Mali 45 NA
April 2018 Afghanistan 350 NA
April 2018 Yemen NA 66
April 2018 Mali 67 NA
May 2018 Afghanistan NA 7
May 2018 Yemen NA NA
May 2018 Mali NA 6
我使用了下面的代码,但很明显它没有使用国家特定信息来计算均值。
for(i in 6:ncol(my_data)) {
my_data[ , i][is.na(my_data[ , i])] <- mean(my_data[ , i], na.rm = TRUE)
}
非常感谢。
您可以使用:
library(dplyr)
library(tidyr)
df %>%
group_by(Country) %>%
mutate(across(c(Battles, Riots), ~ replace_na(.x, mean(.x, na.rm = TRUE)))) %>%
ungroup()
哪个returns
Date Country Battles Riots
<chr> <chr> <dbl> <dbl>
1 March 2018 Afghanistan 380 7
2 March 2018 Yemen 88 5
3 March 2018 Mali 45 6
4 April 2018 Afghanistan 350 7
5 April 2018 Yemen 88 66
6 April 2018 Mali 67 6
7 May 2018 Afghanistan 365 7
8 May 2018 Yemen 88 35.5
9 May 2018 Mali 56 6
数据
structure(list(Date = c("March 2018", "March 2018", "March 2018",
"April 2018", "April 2018", "April 2018", "May 2018", "May 2018",
"May 2018"), Country = c("Afghanistan", "Yemen", "Mali", "Afghanistan",
"Yemen", "Mali", "Afghanistan", "Yemen", "Mali"), Battles = c(380,
88, 45, 350, NA, 67, NA, NA, NA), Riots = c(NA, 5, NA, NA, 66,
NA, 7, NA, 6)), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
一个data.table
选项(从借用df
)
setDT(df)[
,
c("Battles", "Riots") := lapply(
.(Battles, Riots),
function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
),
Country
][]
给予
Date Country Battles Riots
1: March 2018 Afghanistan 380 7.0
2: March 2018 Yemen 88 5.0
3: March 2018 Mali 45 6.0
4: April 2018 Afghanistan 350 7.0
5: April 2018 Yemen 88 66.0
6: April 2018 Mali 67 6.0
7: May 2018 Afghanistan 365 7.0
8: May 2018 Yemen 88 35.5
9: May 2018 Mali 56 6.0
我想用每个国家/地区的观察数据来估算数据框中的 NA。换句话说,在处理 NA 时,应考虑特定国家/地区的价值观。例如;
Date Country Battles Riots
March 2018 Afghanistan 380 NA
March 2018 Yemen 88 5
March 2018 Mali 45 NA
April 2018 Afghanistan 350 NA
April 2018 Yemen NA 66
April 2018 Mali 67 NA
May 2018 Afghanistan NA 7
May 2018 Yemen NA NA
May 2018 Mali NA 6
我使用了下面的代码,但很明显它没有使用国家特定信息来计算均值。
for(i in 6:ncol(my_data)) {
my_data[ , i][is.na(my_data[ , i])] <- mean(my_data[ , i], na.rm = TRUE)
}
非常感谢。
您可以使用:
library(dplyr)
library(tidyr)
df %>%
group_by(Country) %>%
mutate(across(c(Battles, Riots), ~ replace_na(.x, mean(.x, na.rm = TRUE)))) %>%
ungroup()
哪个returns
Date Country Battles Riots
<chr> <chr> <dbl> <dbl>
1 March 2018 Afghanistan 380 7
2 March 2018 Yemen 88 5
3 March 2018 Mali 45 6
4 April 2018 Afghanistan 350 7
5 April 2018 Yemen 88 66
6 April 2018 Mali 67 6
7 May 2018 Afghanistan 365 7
8 May 2018 Yemen 88 35.5
9 May 2018 Mali 56 6
数据
structure(list(Date = c("March 2018", "March 2018", "March 2018",
"April 2018", "April 2018", "April 2018", "May 2018", "May 2018",
"May 2018"), Country = c("Afghanistan", "Yemen", "Mali", "Afghanistan",
"Yemen", "Mali", "Afghanistan", "Yemen", "Mali"), Battles = c(380,
88, 45, 350, NA, 67, NA, NA, NA), Riots = c(NA, 5, NA, NA, 66,
NA, 7, NA, 6)), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))
一个data.table
选项(从df
)
setDT(df)[
,
c("Battles", "Riots") := lapply(
.(Battles, Riots),
function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
),
Country
][]
给予
Date Country Battles Riots
1: March 2018 Afghanistan 380 7.0
2: March 2018 Yemen 88 5.0
3: March 2018 Mali 45 6.0
4: April 2018 Afghanistan 350 7.0
5: April 2018 Yemen 88 66.0
6: April 2018 Mali 67 6.0
7: May 2018 Afghanistan 365 7.0
8: May 2018 Yemen 88 35.5
9: May 2018 Mali 56 6.0