棘手的条件插补,最好使用 Tidyverse
Tricky conditional imputation, ideally using Tidyverse
我遇到一个问题,我需要在标记这些估算值的同时对缺失值进行一些棘手的条件估算,但我不太清楚如何处理它。
我的数据是 Tidy(长)格式。我想要做的是生成一个完整的数据集,其中每个“州”都有一组完整的行,其中包含“男性”、“女性”和“总计”的“出生”值。如果某个州缺少“总计”,则从该“州”的“男性”+“女性”估算。如果我们有“总计”,但没有“男性”或“女性”,则缺失的“出生”值是从“总计”-“男性”(或“女性”,取决于缺失的是什么)计算得出的。
但是,只有当“来源”对于该州的所有当前行都相同时,才能估算缺失值。 我们不能基于来自不同来源的组合数据进行估算。最后,所有估算的行都应该有它们的父状态和来源,并且二进制“聚合”列应该有一个“1”标志。
代表在下方,所需的结果示例在下方,并附有快速说明。如果可能的话,我想用 Tidyverse 来做这件事,但我愿意接受更好的解决方案。提前致谢!!
sex <- c("Male", "Female", "Total", "Male", "Female", "Male", "Female", "Male", "Total")
state <- c("New Jersey", "New Jersey", "New Jersey", "Vermont", "Vermont", "Washington", "Washington", "Montana", "Montana")
source <- c("WHO", "WHO", "WHO", "CDC", "CDC", "UN", "CDC", "UN", "UN")
aggregated <- c(0, 0, 0, 0, 0, 0, 0, 0, 0)
births <- c(20, 30, 50, 15, 16, 20, 27, 15, 33)
df <- data.frame(sex, state, source, aggregated, births)
df
sex state source aggregated births
1 Male New Jersey WHO 0 20
2 Female New Jersey WHO 0 30
3 Total New Jersey WHO 0 50
4 Male Vermont CDC 0 15
5 Female Vermont CDC 0 16
6 Male Washington UN 0 20
7 Female Washington CDC 0 27
8 Male Montana UN 0 15
9 Total Montana UN 0 33
生成集的解释
新泽西州:从一开始就完成,没有变化
佛蒙特州:缺少总计,所有来源相同 (CDC),为总计创建的新行是根据男性 + 女性推算的出生人数
华盛顿:缺少总数,但男性和女性的来源不同,因此无法估算
蒙大拿州:缺少女性,所有来源均相同(联合国),为女性创建的新行是根据总计 - 男性的出生人数估算的。
sex state source aggregated births
1 Male New Jersey WHO 0 20
2 Female New Jersey WHO 0 30
3 Total New Jersey WHO 0 50
4 Male Vermont CDC 0 15
5 Female Vermont CDC 0 16
6 Total Vermont CDC 1 31
7 Male Washington UN 0 20
8 Female Washington CDC 0 27
9 Male Montana UN 0 15
10 Female Montana UN 1 18
11 Total Montana UN 0 33
更新 03
现在我可以好好休息了![=16=]
我知道这与亲爱的@akrun 提出的那两个绝妙的解决方案相比毫无意义。但是我不能在这里留下没有导致所需输出的解决方案。所以我做了一些修改,结果如下,此外,我扩展了代码以防 births
列中的 Male
值丢失。
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = sex, values_from = births) %>%
pivot_longer(Male:Total, names_to = "sex", values_to = "births") %>%
group_split(state, source) %>%
map_dfr(~ if(sum(is.na(.x$births)) > 1 ) drop_na(.x) else .x) %>%
group_by(state, source) %>%
mutate(aggregated = ifelse(is.na(births), 1, 0),
births = ifelse(sex == "Female" & is.na(births), births[sex == "Total"] -
births[sex == "Male"],
ifelse(sex == "Total" & is.na(births),
births[sex == "Female"] + births[sex == "Male"],
ifelse(sex == "Male" & is.na(births),
births[sex == "Total"] - births[sex == "Female"],
births)))) %>%
relocate(state, source, sex)
# A tibble: 11 x 5
# Groups: state, source [5]
state source sex aggregated births
<chr> <chr> <chr> <dbl> <dbl>
1 Montana UN Male 0 15
2 Montana UN Female 1 18
3 Montana UN Total 0 33
4 New Jersey WHO Male 0 20
5 New Jersey WHO Female 0 30
6 New Jersey WHO Total 0 50
7 Vermont CDC Male 0 15
8 Vermont CDC Female 0 16
9 Vermont CDC Total 1 31
10 Washington CDC Female 0 27
11 Washington UN Male 0 20
已更新
由于我亲爱的老师/朋友@akrun 的绝妙解决方案,aggregated
专栏的问题得到了解决:
library(dplyr)
library(tibble)
df %>%
group_split(state, source) %>%
map_dfr(~ if(all(c('Male', 'Female') %in% .x$sex) && !'Total' %in% .x$sex)
{ add_row(.x, sex = 'Total', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births)) }
else if(all(c('Male', 'Total') %in% .x$sex) && !'Female' %in% .x$sex)
{ add_row(.x, sex = 'Female', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births)) }
else .x)
# A tibble: 11 x 5
sex state source aggregated births
<chr> <chr> <chr> <dbl> <dbl>
1 Male Montana UN 0 15
2 Total Montana UN 0 33
3 Female Montana UN 1 48
4 Male New Jersey WHO 0 20
5 Female New Jersey WHO 0 30
6 Total New Jersey WHO 0 50
7 Male Vermont CDC 0 15
8 Female Vermont CDC 0 16
9 Total Vermont CDC 1 31
10 Female Washington CDC 0 27
11 Male Washington UN 0 20
更新 02
亲爱的@akrun 的另一个很好的解决方案:
df %>%
group_by(state, source) %>%
complete(sex = unique(df$sex)) %>%
arrange(state, source, factor(sex, levels = c('Male', 'Female', 'Total'))) %>%
filter(sum(is.na(aggregated)) > 1 & !is.na(aggregated)|sum(is.na(aggregated)) <= 1) %>%
mutate(aggregated = replace(aggregated, is.na(aggregated), 1),
births = case_when(is.na(births) & row_number() == n() ~ sum(births, na.rm = TRUE),
is.na(births) ~ last(births) - na.omit(births)[1], TRUE ~ births))
# A tibble: 11 x 5
# Groups: state, source [5]
state source sex aggregated births
<chr> <chr> <chr> <dbl> <dbl>
1 Montana UN Male 0 15
2 Montana UN Female 1 18
3 Montana UN Total 0 33
4 New Jersey WHO Male 0 20
5 New Jersey WHO Female 0 30
6 New Jersey WHO Total 0 50
7 Vermont CDC Male 0 15
8 Vermont CDC Female 0 16
9 Vermont CDC Total 1 31
10 Washington CDC Female 0 27
11 Washington UN Male 0 20
我遇到一个问题,我需要在标记这些估算值的同时对缺失值进行一些棘手的条件估算,但我不太清楚如何处理它。
我的数据是 Tidy(长)格式。我想要做的是生成一个完整的数据集,其中每个“州”都有一组完整的行,其中包含“男性”、“女性”和“总计”的“出生”值。如果某个州缺少“总计”,则从该“州”的“男性”+“女性”估算。如果我们有“总计”,但没有“男性”或“女性”,则缺失的“出生”值是从“总计”-“男性”(或“女性”,取决于缺失的是什么)计算得出的。
但是,只有当“来源”对于该州的所有当前行都相同时,才能估算缺失值。 我们不能基于来自不同来源的组合数据进行估算。最后,所有估算的行都应该有它们的父状态和来源,并且二进制“聚合”列应该有一个“1”标志。
代表在下方,所需的结果示例在下方,并附有快速说明。如果可能的话,我想用 Tidyverse 来做这件事,但我愿意接受更好的解决方案。提前致谢!!
sex <- c("Male", "Female", "Total", "Male", "Female", "Male", "Female", "Male", "Total")
state <- c("New Jersey", "New Jersey", "New Jersey", "Vermont", "Vermont", "Washington", "Washington", "Montana", "Montana")
source <- c("WHO", "WHO", "WHO", "CDC", "CDC", "UN", "CDC", "UN", "UN")
aggregated <- c(0, 0, 0, 0, 0, 0, 0, 0, 0)
births <- c(20, 30, 50, 15, 16, 20, 27, 15, 33)
df <- data.frame(sex, state, source, aggregated, births)
df
sex state source aggregated births
1 Male New Jersey WHO 0 20
2 Female New Jersey WHO 0 30
3 Total New Jersey WHO 0 50
4 Male Vermont CDC 0 15
5 Female Vermont CDC 0 16
6 Male Washington UN 0 20
7 Female Washington CDC 0 27
8 Male Montana UN 0 15
9 Total Montana UN 0 33
生成集的解释
新泽西州:从一开始就完成,没有变化
佛蒙特州:缺少总计,所有来源相同 (CDC),为总计创建的新行是根据男性 + 女性推算的出生人数
华盛顿:缺少总数,但男性和女性的来源不同,因此无法估算
蒙大拿州:缺少女性,所有来源均相同(联合国),为女性创建的新行是根据总计 - 男性的出生人数估算的。
sex state source aggregated births
1 Male New Jersey WHO 0 20
2 Female New Jersey WHO 0 30
3 Total New Jersey WHO 0 50
4 Male Vermont CDC 0 15
5 Female Vermont CDC 0 16
6 Total Vermont CDC 1 31
7 Male Washington UN 0 20
8 Female Washington CDC 0 27
9 Male Montana UN 0 15
10 Female Montana UN 1 18
11 Total Montana UN 0 33
更新 03 现在我可以好好休息了![=16=]
我知道这与亲爱的@akrun 提出的那两个绝妙的解决方案相比毫无意义。但是我不能在这里留下没有导致所需输出的解决方案。所以我做了一些修改,结果如下,此外,我扩展了代码以防 births
列中的 Male
值丢失。
library(dplyr)
library(tidyr)
df %>%
pivot_wider(names_from = sex, values_from = births) %>%
pivot_longer(Male:Total, names_to = "sex", values_to = "births") %>%
group_split(state, source) %>%
map_dfr(~ if(sum(is.na(.x$births)) > 1 ) drop_na(.x) else .x) %>%
group_by(state, source) %>%
mutate(aggregated = ifelse(is.na(births), 1, 0),
births = ifelse(sex == "Female" & is.na(births), births[sex == "Total"] -
births[sex == "Male"],
ifelse(sex == "Total" & is.na(births),
births[sex == "Female"] + births[sex == "Male"],
ifelse(sex == "Male" & is.na(births),
births[sex == "Total"] - births[sex == "Female"],
births)))) %>%
relocate(state, source, sex)
# A tibble: 11 x 5
# Groups: state, source [5]
state source sex aggregated births
<chr> <chr> <chr> <dbl> <dbl>
1 Montana UN Male 0 15
2 Montana UN Female 1 18
3 Montana UN Total 0 33
4 New Jersey WHO Male 0 20
5 New Jersey WHO Female 0 30
6 New Jersey WHO Total 0 50
7 Vermont CDC Male 0 15
8 Vermont CDC Female 0 16
9 Vermont CDC Total 1 31
10 Washington CDC Female 0 27
11 Washington UN Male 0 20
已更新
由于我亲爱的老师/朋友@akrun 的绝妙解决方案,aggregated
专栏的问题得到了解决:
library(dplyr)
library(tibble)
df %>%
group_split(state, source) %>%
map_dfr(~ if(all(c('Male', 'Female') %in% .x$sex) && !'Total' %in% .x$sex)
{ add_row(.x, sex = 'Total', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births)) }
else if(all(c('Male', 'Total') %in% .x$sex) && !'Female' %in% .x$sex)
{ add_row(.x, sex = 'Female', state = first(.x$state), source = first(.x$source), aggregated = 1, births = sum(.x$births)) }
else .x)
# A tibble: 11 x 5
sex state source aggregated births
<chr> <chr> <chr> <dbl> <dbl>
1 Male Montana UN 0 15
2 Total Montana UN 0 33
3 Female Montana UN 1 48
4 Male New Jersey WHO 0 20
5 Female New Jersey WHO 0 30
6 Total New Jersey WHO 0 50
7 Male Vermont CDC 0 15
8 Female Vermont CDC 0 16
9 Total Vermont CDC 1 31
10 Female Washington CDC 0 27
11 Male Washington UN 0 20
更新 02
亲爱的@akrun 的另一个很好的解决方案:
df %>%
group_by(state, source) %>%
complete(sex = unique(df$sex)) %>%
arrange(state, source, factor(sex, levels = c('Male', 'Female', 'Total'))) %>%
filter(sum(is.na(aggregated)) > 1 & !is.na(aggregated)|sum(is.na(aggregated)) <= 1) %>%
mutate(aggregated = replace(aggregated, is.na(aggregated), 1),
births = case_when(is.na(births) & row_number() == n() ~ sum(births, na.rm = TRUE),
is.na(births) ~ last(births) - na.omit(births)[1], TRUE ~ births))
# A tibble: 11 x 5
# Groups: state, source [5]
state source sex aggregated births
<chr> <chr> <chr> <dbl> <dbl>
1 Montana UN Male 0 15
2 Montana UN Female 1 18
3 Montana UN Total 0 33
4 New Jersey WHO Male 0 20
5 New Jersey WHO Female 0 30
6 New Jersey WHO Total 0 50
7 Vermont CDC Male 0 15
8 Vermont CDC Female 0 16
9 Vermont CDC Total 1 31
10 Washington CDC Female 0 27
11 Washington UN Male 0 20