从连续变量创建虚拟分位数变量
creating dummy quantile variable from continuous variable
这是我正在处理的数据:
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/testmain.csv")
data <- read.csv(text = x)
我想为 year_hh_inc
中的顶部、中间和底部三分之一的值创建一个虚拟变量。我的 id 列 reg_schl
中的每个值都可能有多个 year_hh_inc
的值,因此虚拟变量需要在 reg_schl
上分组。
我希望能够在每个唯一 reg_schl
中区分 year_hh_inc
中的值。
到目前为止,我有以下内容作为 Sotos 的解决方案发布在下面:
data %>%
group_by(reg_schl) %>%
mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
效果很好。
我也用过Allan提供的这个解决方案:
cut_by_id <- function(x)
{
x$category <- cut(x$year_hh_inc, quantile(x$year_hh_inc, c(0,1/3,2/3,1), na.rm = TRUE),
labels = c("low","middle","high"), include.lowest = TRUE)
return(x)
}
data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))
您可以使用 split
- lapply
- rbind
范式:
cut_by_id <- function(x)
{
x$category <- cut(x$inc, quantile(x$inc, c(0,1/3,2/3,1), na.rm = TRUE),
labels = c("low","middle","high"), include.lowest = TRUE)
return(x)
}
data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))
data
#> id inc fee fert fee_per_inc category
#> 1.1 1 11000 125 0.15 0.011363636 low
#> 1.2 1 15000 150 0.12 0.010000000 low
#> 1.3 1 17000 175 0.22 0.010294118 middle
#> 1.4 1 19000 200 0.13 0.010526316 high
#> 1.5 1 21000 225 0.12 0.010714286 high
#> 2.6 2 13000 55 0.11 0.004230769 low
#> 2.7 2 16000 75 0.09 0.004687500 low
#> 2.8 2 19000 85 0.23 0.004473684 middle
#> 2.9 2 21000 95 0.05 0.004523810 high
#> 2.10 2 25000 105 0.01 0.004200000 high
#> 3.11 3 18000 75 0.25 0.004166667 low
#> 3.12 3 21000 85 0.03 0.004047619 low
#> 3.13 3 23000 95 0.05 0.004130435 middle
#> 3.14 3 27000 105 0.15 0.003888889 high
#> 3.15 3 30000 115 0.25 0.003833333 high
box <- boxplot(data$inc ~ data$category, col = 3:5)
由 reprex package (v0.3.0)
于 2020-02-26 创建
我们可以根据分位数创建您的因子变量并传播这些值,即
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
mutate(category = cut(inc, breaks = (quantile(inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
这给出了,
# A tibble: 15 x 8
# Groups: id [3]
id inc fee fert fee_per_inc low middle high
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 11000 125 0.15 0.0114 1 0 0
2 1 15000 150 0.12 0.01 1 0 0
3 1 17000 175 0.22 0.0103 0 1 0
4 1 19000 200 0.13 0.0105 0 0 1
5 1 21000 225 0.12 0.0107 0 0 1
6 2 13000 55 0.11 0.00423 1 0 0
7 2 16000 75 0.09 0.00469 1 0 0
8 2 19000 85 0.23 0.00447 0 1 0
9 2 21000 95 0.05 0.00452 0 0 1
10 2 25000 105 0.01 0.0042 0 0 1
11 3 18000 75 0.25 0.00417 1 0 0
12 3 21000 85 0.03 0.00405 1 0 0
13 3 23000 95 0.05 0.00413 0 1 0
14 3 27000 105 0.15 0.00389 0 0 1
15 3 30000 115 0.25 0.00383 0 0 1
注意 我在 cut
中添加了参数 include.lowest = TRUE
以捕获第一个标签中的最低值 (low
)
这是我正在处理的数据:
x <- getURL("https://raw.githubusercontent.com/dothemathonthatone/maps/master/testmain.csv")
data <- read.csv(text = x)
我想为 year_hh_inc
中的顶部、中间和底部三分之一的值创建一个虚拟变量。我的 id 列 reg_schl
中的每个值都可能有多个 year_hh_inc
的值,因此虚拟变量需要在 reg_schl
上分组。
我希望能够在每个唯一 reg_schl
中区分 year_hh_inc
中的值。
到目前为止,我有以下内容作为 Sotos 的解决方案发布在下面:
data %>%
group_by(reg_schl) %>%
mutate(category = cut(year_hh_inc, breaks = (quantile(year_hh_inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
效果很好。
我也用过Allan提供的这个解决方案:
cut_by_id <- function(x)
{
x$category <- cut(x$year_hh_inc, quantile(x$year_hh_inc, c(0,1/3,2/3,1), na.rm = TRUE),
labels = c("low","middle","high"), include.lowest = TRUE)
return(x)
}
data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))
您可以使用 split
- lapply
- rbind
范式:
cut_by_id <- function(x)
{
x$category <- cut(x$inc, quantile(x$inc, c(0,1/3,2/3,1), na.rm = TRUE),
labels = c("low","middle","high"), include.lowest = TRUE)
return(x)
}
data <- do.call(rbind, lapply(split(data, data$id), cut_by_id))
data
#> id inc fee fert fee_per_inc category
#> 1.1 1 11000 125 0.15 0.011363636 low
#> 1.2 1 15000 150 0.12 0.010000000 low
#> 1.3 1 17000 175 0.22 0.010294118 middle
#> 1.4 1 19000 200 0.13 0.010526316 high
#> 1.5 1 21000 225 0.12 0.010714286 high
#> 2.6 2 13000 55 0.11 0.004230769 low
#> 2.7 2 16000 75 0.09 0.004687500 low
#> 2.8 2 19000 85 0.23 0.004473684 middle
#> 2.9 2 21000 95 0.05 0.004523810 high
#> 2.10 2 25000 105 0.01 0.004200000 high
#> 3.11 3 18000 75 0.25 0.004166667 low
#> 3.12 3 21000 85 0.03 0.004047619 low
#> 3.13 3 23000 95 0.05 0.004130435 middle
#> 3.14 3 27000 105 0.15 0.003888889 high
#> 3.15 3 30000 115 0.25 0.003833333 high
box <- boxplot(data$inc ~ data$category, col = 3:5)
由 reprex package (v0.3.0)
于 2020-02-26 创建我们可以根据分位数创建您的因子变量并传播这些值,即
library(dplyr)
library(tidyr)
data %>%
group_by(id) %>%
mutate(category = cut(inc, breaks = (quantile(inc, c(0, 1 / 3, 2 / 3, 1), na.rm = TRUE)), labels = c("low", "middle", "high"), include.lowest = TRUE), vals = 1) %>%
pivot_wider(names_from = category, values_from = vals, values_fill = list(vals = 0))
这给出了,
# A tibble: 15 x 8 # Groups: id [3] id inc fee fert fee_per_inc low middle high <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 1 11000 125 0.15 0.0114 1 0 0 2 1 15000 150 0.12 0.01 1 0 0 3 1 17000 175 0.22 0.0103 0 1 0 4 1 19000 200 0.13 0.0105 0 0 1 5 1 21000 225 0.12 0.0107 0 0 1 6 2 13000 55 0.11 0.00423 1 0 0 7 2 16000 75 0.09 0.00469 1 0 0 8 2 19000 85 0.23 0.00447 0 1 0 9 2 21000 95 0.05 0.00452 0 0 1 10 2 25000 105 0.01 0.0042 0 0 1 11 3 18000 75 0.25 0.00417 1 0 0 12 3 21000 85 0.03 0.00405 1 0 0 13 3 23000 95 0.05 0.00413 0 1 0 14 3 27000 105 0.15 0.00389 0 0 1 15 3 30000 115 0.25 0.00383 0 0 1
注意 我在 cut
中添加了参数 include.lowest = TRUE
以捕获第一个标签中的最低值 (low
)