您如何计算另一列中超过和低于特定数字的幅度(百分比)?
How do you compute magnitudes (percentages) that are over and under a specific number in another column?
我有这个数据集
study_ID title experiment question_ID participant_ID estimate_level estimate correct_answer question type category age gender
<dbl> <chr> <dbl> <chr> <int> <chr> <dbl> <dbl> <chr> <chr> <chr> <int> <chr>
1 11 Dallacker_Parents'_co… 1 1 1 individual 3 10 How many sugar cubes does or… unlim… nutriti… 32 Female
2 11 Dallacker_Parents'_co… 1 2 1 individual 10 11.5 How many sugar cubes does a … unlim… nutriti… 32 Female
3 11 Dallacker_Parents'_co… 1 3 1 individual 7 6.5 How many sugar cubes does a … unlim… nutriti… 32 Female
4 11 Dallacker_Parents'_co… 1 4 1 individual 1 16.5 How many sugar cubes does a … unlim… nutriti… 32 Female
5 11 Dallacker_Parents'_co… 1 5 1 individual 7 11 How many sugar cubes does a … unlim… nutriti… 32 Female
6 11 Dallacker_Parents'_co… 1 6 1 individual 5 2.5 How many sugar cubes does a … unlim… nutriti… 32 Female
7 11 Dallacker_Parents'_co… 1 1 2 individual 2 10 How many sugar cubes does or… unlim… nutriti… 29 Female
8 11 Dallacker_Parents'_co… 1 2 2 individual 10 11.5 How many sugar cubes does a … unlim… nutriti… 29 Female
9 11 Dallacker_Parents'_co… 1 3 2 individual 1.5 6.5 How many sugar cubes does a … unlim… nutriti… 29 Female
10 11 Dallacker_Parents'_co… 1 4 2 individual 2 16.5 How many sugar cubes does a … unlim… nutriti… 29 Female
这个数据集中有 6 个问题,每个问题都有一个 correct_answer
列和一个 estimate
列。我试图计算每个问题的幅度,以便我得到低估或高估以及正确估计的人的百分比。
例如,对于 6 个问题中的每一个,它都会 return 是这样的:低估了 80%,高估了 10%,正确回答了 10%。
我该怎么做?我很难过。提前致谢!
这是输出
dput(head(DF, 10))
structure(list(study_ID = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5), title = c("5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd"), experiment = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1), question_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
participant_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), estimate_level = c("individual",
"individual", "individual", "individual", "individual", "individual",
"individual", "individual", "individual", "individual"),
estimate = c(2e+07, 4500000, 21075541, 2e+07, 1e+06, 1.1e+07,
2.5e+07, 8e+06, 1.6e+07, 9800000), correct = c(3.8e+07, 3.8e+07,
3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07,
3.8e+07), question = c("What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?"),
type = c("unlimited", "unlimited", "unlimited", "unlimited",
"unlimited", "unlimited", "unlimited", "unlimited", "unlimited",
"unlimited"), category = c("demographics", "demographics",
"demographics", "demographics", "demographics", "demographics",
"demographics", "demographics", "demographics", "demographics"
), age = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA"), gender = c("NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
这是一个 dplyr
方法:
library(dplyr)
df %>%
group_by(question_ID) %>%
summarize(prop_over = mean(estimate > correct),
prop_under = mean(estimate < correct),
prop_correct = mean(estimate == correct)
)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 1 x 4
# question_ID prop_over prop_under prop_correct
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 1 0
list1 <- lapply(split(DF, DF$question_ID), function (x) {
overestimated <- 100 * length(which(x$estimate > x$correct)) / length(x$estimate)
underestimated <- 100 * length(which(x$estimate < x$correct)) / length(x$estimate)
correct <- 100 * length(which(x$estimate == x$correct)) / length(x$estimate)
data.frame(overestimated, underestimated, correct)
})
list2 <- mapply(function (x, y) {
x$question_ID <- y
return (x)
}, x = list1, y = names(list1), SIMPLIFY = F)
Percent_Data <- do.call("rbind", list2)
Percent_Data <- Percent_Data[, c(which(colnames(Percent_Data) == "question_ID"), which(colnames(Percent_Data) != "question_ID"))]
Percent_Data
# question_ID overestimated underestimated correct
# 1 1 0 100 0
我有这个数据集
study_ID title experiment question_ID participant_ID estimate_level estimate correct_answer question type category age gender
<dbl> <chr> <dbl> <chr> <int> <chr> <dbl> <dbl> <chr> <chr> <chr> <int> <chr>
1 11 Dallacker_Parents'_co… 1 1 1 individual 3 10 How many sugar cubes does or… unlim… nutriti… 32 Female
2 11 Dallacker_Parents'_co… 1 2 1 individual 10 11.5 How many sugar cubes does a … unlim… nutriti… 32 Female
3 11 Dallacker_Parents'_co… 1 3 1 individual 7 6.5 How many sugar cubes does a … unlim… nutriti… 32 Female
4 11 Dallacker_Parents'_co… 1 4 1 individual 1 16.5 How many sugar cubes does a … unlim… nutriti… 32 Female
5 11 Dallacker_Parents'_co… 1 5 1 individual 7 11 How many sugar cubes does a … unlim… nutriti… 32 Female
6 11 Dallacker_Parents'_co… 1 6 1 individual 5 2.5 How many sugar cubes does a … unlim… nutriti… 32 Female
7 11 Dallacker_Parents'_co… 1 1 2 individual 2 10 How many sugar cubes does or… unlim… nutriti… 29 Female
8 11 Dallacker_Parents'_co… 1 2 2 individual 10 11.5 How many sugar cubes does a … unlim… nutriti… 29 Female
9 11 Dallacker_Parents'_co… 1 3 2 individual 1.5 6.5 How many sugar cubes does a … unlim… nutriti… 29 Female
10 11 Dallacker_Parents'_co… 1 4 2 individual 2 16.5 How many sugar cubes does a … unlim… nutriti… 29 Female
这个数据集中有 6 个问题,每个问题都有一个 correct_answer
列和一个 estimate
列。我试图计算每个问题的幅度,以便我得到低估或高估以及正确估计的人的百分比。
例如,对于 6 个问题中的每一个,它都会 return 是这样的:低估了 80%,高估了 10%,正确回答了 10%。
我该怎么做?我很难过。提前致谢!
这是输出
dput(head(DF, 10))
structure(list(study_ID = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5), title = c("5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd", "5_Jayles_Debiasing_The_Crowd",
"5_Jayles_Debiasing_The_Crowd"), experiment = c(1, 1, 1, 1, 1,
1, 1, 1, 1, 1), question_ID = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
participant_ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), estimate_level = c("individual",
"individual", "individual", "individual", "individual", "individual",
"individual", "individual", "individual", "individual"),
estimate = c(2e+07, 4500000, 21075541, 2e+07, 1e+06, 1.1e+07,
2.5e+07, 8e+06, 1.6e+07, 9800000), correct = c(3.8e+07, 3.8e+07,
3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07, 3.8e+07,
3.8e+07), question = c("What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?",
"What is the population of Tokyo and its agglomeration?"),
type = c("unlimited", "unlimited", "unlimited", "unlimited",
"unlimited", "unlimited", "unlimited", "unlimited", "unlimited",
"unlimited"), category = c("demographics", "demographics",
"demographics", "demographics", "demographics", "demographics",
"demographics", "demographics", "demographics", "demographics"
), age = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA"), gender = c("NA", "NA", "NA", "NA", "NA", "NA",
"NA", "NA", "NA", "NA")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
这是一个 dplyr
方法:
library(dplyr)
df %>%
group_by(question_ID) %>%
summarize(prop_over = mean(estimate > correct),
prop_under = mean(estimate < correct),
prop_correct = mean(estimate == correct)
)
# `summarise()` ungrouping output (override with `.groups` argument)
# # A tibble: 1 x 4
# question_ID prop_over prop_under prop_correct
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 1 0
list1 <- lapply(split(DF, DF$question_ID), function (x) {
overestimated <- 100 * length(which(x$estimate > x$correct)) / length(x$estimate)
underestimated <- 100 * length(which(x$estimate < x$correct)) / length(x$estimate)
correct <- 100 * length(which(x$estimate == x$correct)) / length(x$estimate)
data.frame(overestimated, underestimated, correct)
})
list2 <- mapply(function (x, y) {
x$question_ID <- y
return (x)
}, x = list1, y = names(list1), SIMPLIFY = F)
Percent_Data <- do.call("rbind", list2)
Percent_Data <- Percent_Data[, c(which(colnames(Percent_Data) == "question_ID"), which(colnames(Percent_Data) != "question_ID"))]
Percent_Data
# question_ID overestimated underestimated correct
# 1 1 0 100 0