如何创建一个以其他数据集变量为级别的变量
How to create a variable with other dataset variables as its levels
我有一个数据集,其中几个变量被二分为 yes/no。
> df[1:20,]
# A tibble: 20 × 2
black white
<fct> <fct>
1 No Yes
2 No Yes
3 No Yes
4 No Yes
5 No Yes
6 No Yes
7 No Yes
8 No Yes
9 No Yes
10 No Yes
11 No Yes
12 No Yes
13 No Yes
14 No Yes
15 No Yes
16 Yes No
17 No Yes
18 No Yes
19 No Yes
20 Yes No
这会产生很多变量(我的真实数据有多个种族选项)并且看起来不太整洁,因为它意味着很多不必要的变量。
我想创建一个新变量(例如 'race'),其中现在的各个变量 'black'、'white' 等是该变量的级别。
就像这个例子
> df2[1:20,]
# A tibble: 20 × 1
race
<fct>
1 White
2 White
3 White
4 White
5 White
6 White
7 White
8 White
9 White
10 White
11 White
12 White
13 White
14 White
15 White
16 Black
17 White
18 White
19 White
20 Black
我该怎么做?
要考虑多个种族,请在行上使用应用 (MARGIN = 1
),并使用 "Yes"
:
粘贴 toString
列名称
df <- structure(list(asian = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes"), black = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "Yes"), white = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
data.frame(race = apply(df == "Yes", 1, \(x) toString(colnames(df)[which(x)])))
race
1 white
2 white
3 white
4 white
5 white
6 white
7 white
8 white
9 white
10 white
11 white
12 white
13 white
14 white
15 white
16 black
17 white
18 white
19 white
20 asian, black
使用max.col
(每个人仅适用于一个值):
data.frame(race = colnames(df)[max.col(df == "Yes")])
使用 dplyr
(假设在您的数据集中一个人只能属于 1 个种族):
library(dplyr)
dat <- data.frame(id = 1:2,
black = c("No", "Yes"),
white = c("Yes", "No"))
dat |> mutate(
race = case_when(black == "Yes" ~ "black",
white == "Yes" ~ "white")
)
输出:
#> id black white race
#> 1 1 No Yes white
#> 2 2 Yes No black
这是一个适用于多种族案例的解决方案。
library(tidyverse)
# Sample data with multiracial case
df <- structure(list(respondent = 1:20, asian = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes"), black = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "Yes"), white = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
df %>%
select(asian:white) %>%
`==`("Yes") %>%
apply(1,
\(.row) colnames(.)[.row] %>%
str_c(collapse = "-"))
#> [1] "white" "white" "white" "white" "white"
#> [6] "white" "white" "white" "white" "white"
#> [11] "white" "white" "white" "white" "white"
#> [16] "black" "white" "white" "white" "asian-black"
由 reprex package (v2.0.1)
于 2022-04-04 创建
我有一个数据集,其中几个变量被二分为 yes/no。
> df[1:20,]
# A tibble: 20 × 2
black white
<fct> <fct>
1 No Yes
2 No Yes
3 No Yes
4 No Yes
5 No Yes
6 No Yes
7 No Yes
8 No Yes
9 No Yes
10 No Yes
11 No Yes
12 No Yes
13 No Yes
14 No Yes
15 No Yes
16 Yes No
17 No Yes
18 No Yes
19 No Yes
20 Yes No
这会产生很多变量(我的真实数据有多个种族选项)并且看起来不太整洁,因为它意味着很多不必要的变量。 我想创建一个新变量(例如 'race'),其中现在的各个变量 'black'、'white' 等是该变量的级别。 就像这个例子
> df2[1:20,]
# A tibble: 20 × 1
race
<fct>
1 White
2 White
3 White
4 White
5 White
6 White
7 White
8 White
9 White
10 White
11 White
12 White
13 White
14 White
15 White
16 Black
17 White
18 White
19 White
20 Black
我该怎么做?
要考虑多个种族,请在行上使用应用 (MARGIN = 1
),并使用 "Yes"
:
toString
列名称
df <- structure(list(asian = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes"), black = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "Yes"), white = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
data.frame(race = apply(df == "Yes", 1, \(x) toString(colnames(df)[which(x)])))
race
1 white
2 white
3 white
4 white
5 white
6 white
7 white
8 white
9 white
10 white
11 white
12 white
13 white
14 white
15 white
16 black
17 white
18 white
19 white
20 asian, black
使用max.col
(每个人仅适用于一个值):
data.frame(race = colnames(df)[max.col(df == "Yes")])
使用 dplyr
(假设在您的数据集中一个人只能属于 1 个种族):
library(dplyr)
dat <- data.frame(id = 1:2,
black = c("No", "Yes"),
white = c("Yes", "No"))
dat |> mutate(
race = case_when(black == "Yes" ~ "black",
white == "Yes" ~ "white")
)
输出:
#> id black white race
#> 1 1 No Yes white
#> 2 2 Yes No black
这是一个适用于多种族案例的解决方案。
library(tidyverse)
# Sample data with multiracial case
df <- structure(list(respondent = 1:20, asian = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes"), black = c("No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "Yes"), white = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No")), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"))
df %>%
select(asian:white) %>%
`==`("Yes") %>%
apply(1,
\(.row) colnames(.)[.row] %>%
str_c(collapse = "-"))
#> [1] "white" "white" "white" "white" "white"
#> [6] "white" "white" "white" "white" "white"
#> [11] "white" "white" "white" "white" "white"
#> [16] "black" "white" "white" "white" "asian-black"
由 reprex package (v2.0.1)
于 2022-04-04 创建