如何创建两列来计算两个条件的总数
How to create two columns that count the total number of two conditions
我有一个糖尿病数据集,其中有一列名为结果,只有两个值,1 = 糖尿病,0 = 非糖尿病。我想根据年龄计算 1 和 0 的总数,然后根据年龄计算 1 的百分比。
我有以下代码:
by_age1 <-
diabetes.df %>%
select(Age, Outcome) %>%
group_by(Age,Outcome) %>%
summarize(Diabetes_Count = n()) %>%
filter(Outcome=="1"| Outcome == "0")
此代码生成此 table
Age | Outcome | Count
21 0 58
21 1 5
以此类推
不过我希望 table 看起来像这样
Age | Count_Outcome=1 | Count_Outcome=0
21 5 58
22 11 61
所以我最终可以做到这一点
Age | Count_Outcome=1 | Count_Outcome=0 | Count_Outcome=1/Count_Outcome=0
21 5 58 0.086
22 11 61 0.180
这是数据集
Rows: 768
Columns: 23
$ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, ...
$ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, 189, 166, 100, 118, 107, 103, 115, 126, ...
$ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, 84, 74, 30, 70, 88, 84, 90, 80, 94, 70, ...
$ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30, 41, 0, 0, 35, 33, 26, 0, 15, 19...
$ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 83, 96, 235, 0, 0, 0, 146, 115, 0, 1...
$ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.6, 38.0, 27.1, 30.1, 25.8, 30.0, 45.8, 2...
$ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158, 0.232, 0.191, 0.537, 1.441, 0.398, 0.58...
$ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33, 32, 27, 50, 41, 29, 51, 41...
$ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, ...
$ Skin.log <dbl> 3.555634, 3.367641, -4.605170, 3.135929, 3.555634, -4.605170, 3.466048, -4.605170, 3.806885, -4.605170...
$ Insulin.log <dbl> -2.302585, -2.302585, -2.302585, 4.544358, 5.124559, -2.302585, 4.478473, -2.302585, 6.297293, -2.3025...
$ DPF.log <dbl> -0.46680874, -1.04696906, -0.39749694, -1.78976147, 0.82767807, -1.60445037, -1.39432653, -2.00991548,...
$ Preg.log <dbl> 1.793424749, 0.009950331, 2.080690761, 0.009950331, -4.605170186, 1.611435915, 1.101940079, 2.30358459...
$ Age.log <dbl> 3.912023, 3.433987, 3.465736, 3.044522, 3.496508, 3.401197, 3.258097, 3.367296, 3.970292, 3.988984, 3....
$ G <dbl> 0.84777132, -1.12266474, 1.94245802, -0.99755769, 0.50372693, -0.15308509, -1.34160209, -0.18436186, 2...
$ BP <dbl> 0.14954330, -0.16044119, -0.26376935, -0.16044119, -1.50370731, 0.25287146, -0.98706650, -3.57027057, ...
$ S <dbl> 0.7143403, 0.6624894, -1.5365134, 0.5985804, 0.7143403, -1.5365134, 0.6896315, -1.5365134, 0.7836385, ...
$ I <dbl> -1.0157459, -1.0157459, -1.0157459, 0.8904827, 1.0520140, -1.0157459, 0.8721398, -1.0157459, 1.3785101...
$ D <dbl> 0.76534970, -0.13507072, 0.87292300, -1.28789940, 2.77441913, -1.00029287, -0.67417647, -1.62958283, -...
$ BM <dbl> 0.20387991, -0.68397621, -1.10253696, -0.49372133, 1.40882750, -0.81081280, -0.12589522, 0.41950211, -...
$ P <dbl> 0.6504082, -0.1684863, 0.7823084, -0.1684863, -2.2875506, 0.5668468, 0.3329083, 0.8846516, 0.1474983, ...
$ A <dbl> 1.43544387, -0.04590939, 0.05247453, -1.25279578, 0.14783077, -0.14751959, -0.59096525, -0.25257485, 1...
$ Segment <int> 4, 3, 2, 3, 5, 2, 3, 1, 4, 2, 2, 2, 2, 4, 4, 1, 5, 2, 3, 3, 4, 2, 2, 3, 4, 4, 2, 3, 4, 2, 4, 4, 3, 2, ...
``
随机数据:
r <- function(x) {rnorm(x, 50, 2)}
set.seed(123)
diabetes.df <- data.frame(Age = round(r(10)), Outcome = as.character((r(10) < 50)*1))
> diabetes.df
Age Outcome
1 49 0
2 50 0
3 53 0
4 50 0
5 50 1
6 53 0
7 51 0
8 47 1
9 49 0
10 49 1
然后pivot_wider()
会做你想做的事:
df <- diabetes.df %>%
select(Age, Outcome) %>%
group_by(Age,Outcome) %>%
dplyr::summarize(Diabetes_Count = n()) %>%
filter(Outcome=="1"| Outcome == "0")
df = pivot_wider(df, names_from = c("Outcome"), values_from = "Diabetes_Count", names_prefix = "Outcome_", values_fill = 0)
> df
# A tibble: 5 x 3
# Groups: Age [5]
Age Outcome_1 Outcome_0
<dbl> <int> <int>
1 47 1 0
2 49 1 2
3 50 1 2
4 51 0 1
5 53 0 2
> df %>% mutate(`Outcome_1/Outcome_0` = Outcome_1 / Outcome_0)
# A tibble: 5 x 4
# Groups: Age [5]
Age Outcome_1 Outcome_0 `Outcome_1/Outcome_0`
<dbl> <int> <int> <dbl>
1 47 1 0 Inf
2 49 1 2 0.5
3 50 1 2 0.5
4 51 0 1 0
5 53 0 2 0
我有一个糖尿病数据集,其中有一列名为结果,只有两个值,1 = 糖尿病,0 = 非糖尿病。我想根据年龄计算 1 和 0 的总数,然后根据年龄计算 1 的百分比。
我有以下代码:
by_age1 <-
diabetes.df %>%
select(Age, Outcome) %>%
group_by(Age,Outcome) %>%
summarize(Diabetes_Count = n()) %>%
filter(Outcome=="1"| Outcome == "0")
此代码生成此 table
Age | Outcome | Count
21 0 58
21 1 5
以此类推
不过我希望 table 看起来像这样
Age | Count_Outcome=1 | Count_Outcome=0
21 5 58
22 11 61
所以我最终可以做到这一点
Age | Count_Outcome=1 | Count_Outcome=0 | Count_Outcome=1/Count_Outcome=0
21 5 58 0.086
22 11 61 0.180
这是数据集
Rows: 768
Columns: 23
$ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, ...
$ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, 189, 166, 100, 118, 107, 103, 115, 126, ...
$ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, 84, 74, 30, 70, 88, 84, 90, 80, 94, 70, ...
$ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30, 41, 0, 0, 35, 33, 26, 0, 15, 19...
$ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 83, 96, 235, 0, 0, 0, 146, 115, 0, 1...
$ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.6, 38.0, 27.1, 30.1, 25.8, 30.0, 45.8, 2...
$ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158, 0.232, 0.191, 0.537, 1.441, 0.398, 0.58...
$ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33, 32, 27, 50, 41, 29, 51, 41...
$ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, ...
$ Skin.log <dbl> 3.555634, 3.367641, -4.605170, 3.135929, 3.555634, -4.605170, 3.466048, -4.605170, 3.806885, -4.605170...
$ Insulin.log <dbl> -2.302585, -2.302585, -2.302585, 4.544358, 5.124559, -2.302585, 4.478473, -2.302585, 6.297293, -2.3025...
$ DPF.log <dbl> -0.46680874, -1.04696906, -0.39749694, -1.78976147, 0.82767807, -1.60445037, -1.39432653, -2.00991548,...
$ Preg.log <dbl> 1.793424749, 0.009950331, 2.080690761, 0.009950331, -4.605170186, 1.611435915, 1.101940079, 2.30358459...
$ Age.log <dbl> 3.912023, 3.433987, 3.465736, 3.044522, 3.496508, 3.401197, 3.258097, 3.367296, 3.970292, 3.988984, 3....
$ G <dbl> 0.84777132, -1.12266474, 1.94245802, -0.99755769, 0.50372693, -0.15308509, -1.34160209, -0.18436186, 2...
$ BP <dbl> 0.14954330, -0.16044119, -0.26376935, -0.16044119, -1.50370731, 0.25287146, -0.98706650, -3.57027057, ...
$ S <dbl> 0.7143403, 0.6624894, -1.5365134, 0.5985804, 0.7143403, -1.5365134, 0.6896315, -1.5365134, 0.7836385, ...
$ I <dbl> -1.0157459, -1.0157459, -1.0157459, 0.8904827, 1.0520140, -1.0157459, 0.8721398, -1.0157459, 1.3785101...
$ D <dbl> 0.76534970, -0.13507072, 0.87292300, -1.28789940, 2.77441913, -1.00029287, -0.67417647, -1.62958283, -...
$ BM <dbl> 0.20387991, -0.68397621, -1.10253696, -0.49372133, 1.40882750, -0.81081280, -0.12589522, 0.41950211, -...
$ P <dbl> 0.6504082, -0.1684863, 0.7823084, -0.1684863, -2.2875506, 0.5668468, 0.3329083, 0.8846516, 0.1474983, ...
$ A <dbl> 1.43544387, -0.04590939, 0.05247453, -1.25279578, 0.14783077, -0.14751959, -0.59096525, -0.25257485, 1...
$ Segment <int> 4, 3, 2, 3, 5, 2, 3, 1, 4, 2, 2, 2, 2, 4, 4, 1, 5, 2, 3, 3, 4, 2, 2, 3, 4, 4, 2, 3, 4, 2, 4, 4, 3, 2, ...
``
随机数据:
r <- function(x) {rnorm(x, 50, 2)}
set.seed(123)
diabetes.df <- data.frame(Age = round(r(10)), Outcome = as.character((r(10) < 50)*1))
> diabetes.df
Age Outcome
1 49 0
2 50 0
3 53 0
4 50 0
5 50 1
6 53 0
7 51 0
8 47 1
9 49 0
10 49 1
然后pivot_wider()
会做你想做的事:
df <- diabetes.df %>%
select(Age, Outcome) %>%
group_by(Age,Outcome) %>%
dplyr::summarize(Diabetes_Count = n()) %>%
filter(Outcome=="1"| Outcome == "0")
df = pivot_wider(df, names_from = c("Outcome"), values_from = "Diabetes_Count", names_prefix = "Outcome_", values_fill = 0)
> df
# A tibble: 5 x 3
# Groups: Age [5]
Age Outcome_1 Outcome_0
<dbl> <int> <int>
1 47 1 0
2 49 1 2
3 50 1 2
4 51 0 1
5 53 0 2
> df %>% mutate(`Outcome_1/Outcome_0` = Outcome_1 / Outcome_0)
# A tibble: 5 x 4
# Groups: Age [5]
Age Outcome_1 Outcome_0 `Outcome_1/Outcome_0`
<dbl> <int> <int> <dbl>
1 47 1 0 Inf
2 49 1 2 0.5
3 50 1 2 0.5
4 51 0 1 0
5 53 0 2 0