生成内部年龄和性别 z 分数
Generate internal age and sex z-scores
我有以下数据框,其中包含 1000 人的性别数据、三个重复的身高测量值和每个测量值的年龄。
data <- data.frame(
child_id = 1:1000,
sex = rbinom(n = 1000, size = 1, prob = 0.5),
height_5 = rnorm(1000, mean = 80, sd = 5),
height_6 = rnorm(1000, mean = 90, sd = 5),
height_7 = rnorm(1000, mean = 100, sd = 5),
age_5 = rnorm(1000, mean = 5.2, sd = 1.5),
age_6 = rnorm(1000, mean = 6.1, sd = 1.5),
age_7 = rnorm(1000, mean = 7.3, sd = 1.5)
)
data$sex <- factor(data$sex,
levels = c(0,1),
labels = c("Male", "Female"))
### Generate SOME MISSING VALUES -----
data$height_5[which(data$height_5 %in% sample(data$height_5, 25))] <- NA
data$height_6[which(data$height_6 %in% sample(data$height_6, 25))] <- NA
data$height_7[which(data$height_7 %in% sample(data$height_7, 25))] <- NA
我可以按如下方式在每次测量时生成 zscores
data$ht5z <- scale(data$height_5, center = TRUE, scale = TRUE)
data$ht6z <- scale(data$height_6, center = TRUE, scale = TRUE)
data$ht7z <- scale(data$height_7, center = TRUE, scale = TRUE)
我如何为每个性别和年份生成这些,例如htzm3 如果性别 = 男性且年龄 >=3 且 <4,htzm4 如果性别 = 男性且年龄 >=4 且 <5 等
这个怎么样:
library(dplyr)
library(stringr)
library(tidyr)
data %>%
gather(key, value, age_5, age_6, age_7, height_5, height_6, height_7) %>%
separate(key, c("key", "obs_time"), "_") %>%
spread(key, value) %>%
mutate(whole_age = floor(age)) %>%
group_by(sex, whole_age) %>%
mutate(htz = scale(height),
sex_init = str_to_lower(str_extract(sex, "^.")),
sa = paste0("htz", sex_init, whole_age)) %>%
ungroup() %>%
spread(sa, htz)
首先我们想把数据整理成整齐的格式。
为此,我们首先将您所有的年龄和身高列收集到两列中:key
和 value
。 key
然后取原变量名作为值,value
取对应变量下的值,其他变量照原样复制下来。数据现在看起来像这样:
# A tibble: 6,000 x 4
child_id sex key value
<int> <fct> <chr> <dbl>
1 1 Male age_5 5.67
2 1 Male age_6 7.02
3 1 Male age_7 8.86
4 1 Male height_5 79.2
5 1 Male height_6 95.8
6 1 Male height_7 85.0
7 2 Male age_5 3.38
8 2 Male age_6 5.06
9 2 Male age_7 5.47
10 2 Male height_5 79.2
# ... with 5,990 more rows
其次,我们将 key
列分成两列:key
和 obs_time
,使用“_”作为分隔符。数据现在看起来像:
# A tibble: 6,000 x 5
child_id sex key obs_time value
<int> <fct> <chr> <chr> <dbl>
1 1 Male age 5 5.67
2 1 Male age 6 7.02
3 1 Male age 7 8.86
4 1 Male height 5 79.2
5 1 Male height 6 95.8
6 1 Male height 7 85.0
7 2 Male age 5 3.38
8 2 Male age 6 5.06
9 2 Male age 7 5.47
10 2 Male height 5 79.2
# ... with 5,990 more rows
第三,我们将值分散到两个变量中:age
和 height
。数据现在看起来像:
# A tibble: 3,000 x 5
child_id sex obs_time age height
<int> <fct> <chr> <dbl> <dbl>
1 1 Male 5 5.67 79.2
2 1 Male 6 7.02 95.8
3 1 Male 7 8.86 85.0
4 2 Male 5 3.38 79.2
5 2 Male 6 5.06 81.8
6 2 Male 7 5.47 102.
7 3 Male 5 5.04 80.4
8 3 Male 6 6.37 95.3
9 3 Male 7 7.01 97.4
10 4 Male 5 6.25 90.8
# ... with 2,990 more rows
第四,到第七,我们改变年龄类别 whole_age
,然后按 sex
和 whole_age
分组,这样当我们缩放时,它将分别应用于每个组.然后我们在每个组中进行缩放,提取 sex
的第一个初始值,并在名为 sa
的列中构造与新缩放值对应的变量名称。然后我们可以删除分组。数据现在看起来像:
# A tibble: 3,000 x 9
child_id sex obs_time age height whole_age htz sex_init sa
<int> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 Male 5 5.67 79.2 5 -0.967 m htzm5
2 1 Male 6 7.02 95.8 7 0.345 m htzm7
3 1 Male 7 8.86 85.0 8 -1.20 m htzm8
4 2 Male 5 3.38 79.2 3 -0.580 m htzm3
5 2 Male 6 5.06 81.8 5 -0.681 m htzm5
6 2 Male 7 5.47 102. 5 1.55 m htzm5
7 3 Male 5 5.04 80.4 5 -0.829 m htzm5
8 3 Male 6 6.37 95.3 6 0.455 m htzm6
9 3 Male 7 7.01 97.4 7 0.529 m htzm7
10 4 Male 5 6.25 90.8 6 -0.0366 m htzm6
# ... with 2,990 more rows
最后,我们可以将数据散布到您要求的变量中。现在我们有:
# A tibble: 3,000 x 32
child_id sex obs_time age height whole_age sex_init htzf0 htzf1 htzf10 htzf11 htzf2 htzf3
<int> <fct> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Male 5 5.67 79.2 5 m NA NA NA NA NA NA
2 1 Male 6 7.02 95.8 7 m NA NA NA NA NA NA
3 1 Male 7 8.86 85.0 8 m NA NA NA NA NA NA
4 2 Male 5 3.38 79.2 3 m NA NA NA NA NA NA
5 2 Male 6 5.06 81.8 5 m NA NA NA NA NA NA
6 2 Male 7 5.47 102. 5 m NA NA NA NA NA NA
7 3 Male 5 5.04 80.4 5 m NA NA NA NA NA NA
8 3 Male 6 6.37 95.3 6 m NA NA NA NA NA NA
9 3 Male 7 7.01 97.4 7 m NA NA NA NA NA NA
10 4 Male 5 6.25 90.8 6 m NA NA NA NA NA NA
# ... with 2,990 more rows, and 19 more variables: htzf4 <dbl>, htzf5 <dbl>, htzf6 <dbl>,
# htzf7 <dbl>, htzf8 <dbl>, htzf9 <dbl>, htzm0 <dbl>, htzm1 <dbl>, htzm10 <dbl>, htzm11 <dbl>,
# htzm12 <dbl>, htzm2 <dbl>, htzm3 <dbl>, htzm4 <dbl>, htzm5 <dbl>, htzm6 <dbl>, htzm7 <dbl>,
# htzm8 <dbl>, htzm9 <dbl>
我有以下数据框,其中包含 1000 人的性别数据、三个重复的身高测量值和每个测量值的年龄。
data <- data.frame(
child_id = 1:1000,
sex = rbinom(n = 1000, size = 1, prob = 0.5),
height_5 = rnorm(1000, mean = 80, sd = 5),
height_6 = rnorm(1000, mean = 90, sd = 5),
height_7 = rnorm(1000, mean = 100, sd = 5),
age_5 = rnorm(1000, mean = 5.2, sd = 1.5),
age_6 = rnorm(1000, mean = 6.1, sd = 1.5),
age_7 = rnorm(1000, mean = 7.3, sd = 1.5)
)
data$sex <- factor(data$sex,
levels = c(0,1),
labels = c("Male", "Female"))
### Generate SOME MISSING VALUES -----
data$height_5[which(data$height_5 %in% sample(data$height_5, 25))] <- NA
data$height_6[which(data$height_6 %in% sample(data$height_6, 25))] <- NA
data$height_7[which(data$height_7 %in% sample(data$height_7, 25))] <- NA
我可以按如下方式在每次测量时生成 zscores
data$ht5z <- scale(data$height_5, center = TRUE, scale = TRUE)
data$ht6z <- scale(data$height_6, center = TRUE, scale = TRUE)
data$ht7z <- scale(data$height_7, center = TRUE, scale = TRUE)
我如何为每个性别和年份生成这些,例如htzm3 如果性别 = 男性且年龄 >=3 且 <4,htzm4 如果性别 = 男性且年龄 >=4 且 <5 等
这个怎么样:
library(dplyr)
library(stringr)
library(tidyr)
data %>%
gather(key, value, age_5, age_6, age_7, height_5, height_6, height_7) %>%
separate(key, c("key", "obs_time"), "_") %>%
spread(key, value) %>%
mutate(whole_age = floor(age)) %>%
group_by(sex, whole_age) %>%
mutate(htz = scale(height),
sex_init = str_to_lower(str_extract(sex, "^.")),
sa = paste0("htz", sex_init, whole_age)) %>%
ungroup() %>%
spread(sa, htz)
首先我们想把数据整理成整齐的格式。
为此,我们首先将您所有的年龄和身高列收集到两列中:key
和 value
。 key
然后取原变量名作为值,value
取对应变量下的值,其他变量照原样复制下来。数据现在看起来像这样:
# A tibble: 6,000 x 4
child_id sex key value
<int> <fct> <chr> <dbl>
1 1 Male age_5 5.67
2 1 Male age_6 7.02
3 1 Male age_7 8.86
4 1 Male height_5 79.2
5 1 Male height_6 95.8
6 1 Male height_7 85.0
7 2 Male age_5 3.38
8 2 Male age_6 5.06
9 2 Male age_7 5.47
10 2 Male height_5 79.2
# ... with 5,990 more rows
其次,我们将 key
列分成两列:key
和 obs_time
,使用“_”作为分隔符。数据现在看起来像:
# A tibble: 6,000 x 5
child_id sex key obs_time value
<int> <fct> <chr> <chr> <dbl>
1 1 Male age 5 5.67
2 1 Male age 6 7.02
3 1 Male age 7 8.86
4 1 Male height 5 79.2
5 1 Male height 6 95.8
6 1 Male height 7 85.0
7 2 Male age 5 3.38
8 2 Male age 6 5.06
9 2 Male age 7 5.47
10 2 Male height 5 79.2
# ... with 5,990 more rows
第三,我们将值分散到两个变量中:age
和 height
。数据现在看起来像:
# A tibble: 3,000 x 5
child_id sex obs_time age height
<int> <fct> <chr> <dbl> <dbl>
1 1 Male 5 5.67 79.2
2 1 Male 6 7.02 95.8
3 1 Male 7 8.86 85.0
4 2 Male 5 3.38 79.2
5 2 Male 6 5.06 81.8
6 2 Male 7 5.47 102.
7 3 Male 5 5.04 80.4
8 3 Male 6 6.37 95.3
9 3 Male 7 7.01 97.4
10 4 Male 5 6.25 90.8
# ... with 2,990 more rows
第四,到第七,我们改变年龄类别 whole_age
,然后按 sex
和 whole_age
分组,这样当我们缩放时,它将分别应用于每个组.然后我们在每个组中进行缩放,提取 sex
的第一个初始值,并在名为 sa
的列中构造与新缩放值对应的变量名称。然后我们可以删除分组。数据现在看起来像:
# A tibble: 3,000 x 9
child_id sex obs_time age height whole_age htz sex_init sa
<int> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 Male 5 5.67 79.2 5 -0.967 m htzm5
2 1 Male 6 7.02 95.8 7 0.345 m htzm7
3 1 Male 7 8.86 85.0 8 -1.20 m htzm8
4 2 Male 5 3.38 79.2 3 -0.580 m htzm3
5 2 Male 6 5.06 81.8 5 -0.681 m htzm5
6 2 Male 7 5.47 102. 5 1.55 m htzm5
7 3 Male 5 5.04 80.4 5 -0.829 m htzm5
8 3 Male 6 6.37 95.3 6 0.455 m htzm6
9 3 Male 7 7.01 97.4 7 0.529 m htzm7
10 4 Male 5 6.25 90.8 6 -0.0366 m htzm6
# ... with 2,990 more rows
最后,我们可以将数据散布到您要求的变量中。现在我们有:
# A tibble: 3,000 x 32
child_id sex obs_time age height whole_age sex_init htzf0 htzf1 htzf10 htzf11 htzf2 htzf3
<int> <fct> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Male 5 5.67 79.2 5 m NA NA NA NA NA NA
2 1 Male 6 7.02 95.8 7 m NA NA NA NA NA NA
3 1 Male 7 8.86 85.0 8 m NA NA NA NA NA NA
4 2 Male 5 3.38 79.2 3 m NA NA NA NA NA NA
5 2 Male 6 5.06 81.8 5 m NA NA NA NA NA NA
6 2 Male 7 5.47 102. 5 m NA NA NA NA NA NA
7 3 Male 5 5.04 80.4 5 m NA NA NA NA NA NA
8 3 Male 6 6.37 95.3 6 m NA NA NA NA NA NA
9 3 Male 7 7.01 97.4 7 m NA NA NA NA NA NA
10 4 Male 5 6.25 90.8 6 m NA NA NA NA NA NA
# ... with 2,990 more rows, and 19 more variables: htzf4 <dbl>, htzf5 <dbl>, htzf6 <dbl>,
# htzf7 <dbl>, htzf8 <dbl>, htzf9 <dbl>, htzm0 <dbl>, htzm1 <dbl>, htzm10 <dbl>, htzm11 <dbl>,
# htzm12 <dbl>, htzm2 <dbl>, htzm3 <dbl>, htzm4 <dbl>, htzm5 <dbl>, htzm6 <dbl>, htzm7 <dbl>,
# htzm8 <dbl>, htzm9 <dbl>