如何用 R 中另一个变量的分位数创建一个变量?
How to create a variable with the quantiles of another one in R?
我正在尝试使用“dplyr”命令 mutate 创建一个变量,它必须指示另一个变量的分位数。
例如:
# 1. Fake data:
data <- data.frame(
"id" = seq(1:20),
"score" = round(rnorm(20,30,20)))
# 2. Creating varaible 'Quantile_5'
data <-data %>%
mutate(Quntile_5 = ????)
到目前为止,我已经创建了一个函数来识别 returns 分位数作为一个因素,并且它确实有效
# 3. Create a function:
quantile5 <- function(x){
x = ifelse(
x < quantile(x,0.2),1,
ifelse(x >= quantile(x,0.2) & x < quantile(x,0.4),2,
ifelse(x >= quantile(x,0.4) & x < quantile(x,0.6),3,
ifelse(x >= quantile(x,0.6) & x < quantile(x,0.8),4,5
))))
return(as.factor(x))
}
# 4. Running the code:
data <-data %>%
mutate(Quntile_5 = quantile5(score))
# 5. Result:
data
id score Quntile_5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
11 11 37 3
12 12 18 2
13 13 54 5
14 14 47 4
15 15 52 4
16 16 -4 1
17 17 53 4
18 18 51 4
19 19 -7 1
20 20 -2 1
但是,如果我想创建一个变量“Quantile_100”作为一个因子,指示每个观察值在 1 到 100 之间的哪个位置(在较大数据集的上下文中),这不是一个很好的解决方案。有没有更简单的方法来创建这些五分位数变量?
希望这就是您要找的:
library(dplyr)
data <- data.frame(
"id" = seq(1:20),
"score" = round(rnorm(20,30,20)))
data %>%
mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.01)),
rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile100
1 1 59 95
2 2 47 90
3 3 83 100
4 4 33 53
5 5 7 11
6 6 26 43
7 7 16 16
8 8 18 27
9 9 33 53
10 10 47 90
我选择关闭最右边的垃圾箱,使最大类别不超过 100。
我们也可以用你自己的例子来验证,结果是一样的:
df %>%
mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.2)),
rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
数据
structure(list(id = 1:20, score = c(55L, 56L, 26L, 42L, 41L,
26L, 57L, 12L, 21L, 25L, 37L, 18L, 54L, 47L, 52L, -4L, 53L, 51L,
-7L, -2L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
这里有两个选项 cut
:
1.
library(dplyr)
data %>% mutate(quantile100 = cut(score, 100, label = FALSE))
#This is similar to @Anoushiravan R `findInterval` function.
data %>%
mutate(quantile100 = cut(score, unique(quantile(score, seq(0, 1, 0.01))), labels = FALSE))
我正在尝试使用“dplyr”命令 mutate 创建一个变量,它必须指示另一个变量的分位数。
例如:
# 1. Fake data:
data <- data.frame(
"id" = seq(1:20),
"score" = round(rnorm(20,30,20)))
# 2. Creating varaible 'Quantile_5'
data <-data %>%
mutate(Quntile_5 = ????)
到目前为止,我已经创建了一个函数来识别 returns 分位数作为一个因素,并且它确实有效
# 3. Create a function:
quantile5 <- function(x){
x = ifelse(
x < quantile(x,0.2),1,
ifelse(x >= quantile(x,0.2) & x < quantile(x,0.4),2,
ifelse(x >= quantile(x,0.4) & x < quantile(x,0.6),3,
ifelse(x >= quantile(x,0.6) & x < quantile(x,0.8),4,5
))))
return(as.factor(x))
}
# 4. Running the code:
data <-data %>%
mutate(Quntile_5 = quantile5(score))
# 5. Result:
data
id score Quntile_5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
11 11 37 3
12 12 18 2
13 13 54 5
14 14 47 4
15 15 52 4
16 16 -4 1
17 17 53 4
18 18 51 4
19 19 -7 1
20 20 -2 1
但是,如果我想创建一个变量“Quantile_100”作为一个因子,指示每个观察值在 1 到 100 之间的哪个位置(在较大数据集的上下文中),这不是一个很好的解决方案。有没有更简单的方法来创建这些五分位数变量?
希望这就是您要找的:
library(dplyr)
data <- data.frame(
"id" = seq(1:20),
"score" = round(rnorm(20,30,20)))
data %>%
mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.01)),
rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile100
1 1 59 95
2 2 47 90
3 3 83 100
4 4 33 53
5 5 7 11
6 6 26 43
7 7 16 16
8 8 18 27
9 9 33 53
10 10 47 90
我选择关闭最右边的垃圾箱,使最大类别不超过 100。 我们也可以用你自己的例子来验证,结果是一样的:
df %>%
mutate(quantile100 = findInterval(score, quantile(score, probs = seq(0, 1, 0.2)),
rightmost.closed = TRUE)) %>%
slice_head(n = 10)
id score quantile5
1 1 55 5
2 2 56 5
3 3 26 3
4 4 42 3
5 5 41 3
6 6 26 3
7 7 57 5
8 8 12 1
9 9 21 2
10 10 25 2
数据
structure(list(id = 1:20, score = c(55L, 56L, 26L, 42L, 41L,
26L, 57L, 12L, 21L, 25L, 37L, 18L, 54L, 47L, 52L, -4L, 53L, 51L,
-7L, -2L)), class = "data.frame", row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"))
这里有两个选项 cut
:
1.
library(dplyr)
data %>% mutate(quantile100 = cut(score, 100, label = FALSE))
#This is similar to @Anoushiravan R `findInterval` function.
data %>%
mutate(quantile100 = cut(score, unique(quantile(score, seq(0, 1, 0.01))), labels = FALSE))