如何创建两列来计算两个条件的总数

How to create two columns that count the total number of two conditions

我有一个糖尿病数据集,其中有一列名为结果,只有两个值,1 = 糖尿病,0 = 非糖尿病。我想根据年龄计算 1 和 0 的总数,然后根据年龄计算 1 的百分比。

我有以下代码:

by_age1 <- 
  diabetes.df %>%
  select(Age, Outcome) %>%
  group_by(Age,Outcome) %>%
  summarize(Diabetes_Count = n()) %>%
  filter(Outcome=="1"| Outcome == "0")

此代码生成此 table

Age | Outcome | Count
21      0        58
21      1         5    

以此类推

不过我希望 table 看起来像这样

Age | Count_Outcome=1 | Count_Outcome=0
21          5                 58
22          11                61

所以我最终可以做到这一点

Age | Count_Outcome=1 | Count_Outcome=0 | Count_Outcome=1/Count_Outcome=0
21          5                 58                    0.086
22          11                61                    0.180

这是数据集

Rows: 768
Columns: 23
$ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, ...
$ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139, 189, 166, 100, 118, 107, 103, 115, 126, ...
$ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0, 84, 74, 30, 70, 88, 84, 90, 80, 94, 70, ...
$ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0, 38, 30, 41, 0, 0, 35, 33, 26, 0, 15, 19...
$ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230, 0, 83, 96, 235, 0, 0, 0, 146, 115, 0, 1...
$ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37.6, 38.0, 27.1, 30.1, 25.8, 30.0, 45.8, 2...
$ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158, 0.232, 0.191, 0.537, 1.441, 0.398, 0.58...
$ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 32, 31, 31, 33, 32, 27, 50, 41, 29, 51, 41...
$ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, ...
$ Skin.log                 <dbl> 3.555634, 3.367641, -4.605170, 3.135929, 3.555634, -4.605170, 3.466048, -4.605170, 3.806885, -4.605170...
$ Insulin.log              <dbl> -2.302585, -2.302585, -2.302585, 4.544358, 5.124559, -2.302585, 4.478473, -2.302585, 6.297293, -2.3025...
$ DPF.log                  <dbl> -0.46680874, -1.04696906, -0.39749694, -1.78976147, 0.82767807, -1.60445037, -1.39432653, -2.00991548,...
$ Preg.log                 <dbl> 1.793424749, 0.009950331, 2.080690761, 0.009950331, -4.605170186, 1.611435915, 1.101940079, 2.30358459...
$ Age.log                  <dbl> 3.912023, 3.433987, 3.465736, 3.044522, 3.496508, 3.401197, 3.258097, 3.367296, 3.970292, 3.988984, 3....
$ G                        <dbl> 0.84777132, -1.12266474, 1.94245802, -0.99755769, 0.50372693, -0.15308509, -1.34160209, -0.18436186, 2...
$ BP                       <dbl> 0.14954330, -0.16044119, -0.26376935, -0.16044119, -1.50370731, 0.25287146, -0.98706650, -3.57027057, ...
$ S                        <dbl> 0.7143403, 0.6624894, -1.5365134, 0.5985804, 0.7143403, -1.5365134, 0.6896315, -1.5365134, 0.7836385, ...
$ I                        <dbl> -1.0157459, -1.0157459, -1.0157459, 0.8904827, 1.0520140, -1.0157459, 0.8721398, -1.0157459, 1.3785101...
$ D                        <dbl> 0.76534970, -0.13507072, 0.87292300, -1.28789940, 2.77441913, -1.00029287, -0.67417647, -1.62958283, -...
$ BM                       <dbl> 0.20387991, -0.68397621, -1.10253696, -0.49372133, 1.40882750, -0.81081280, -0.12589522, 0.41950211, -...
$ P                        <dbl> 0.6504082, -0.1684863, 0.7823084, -0.1684863, -2.2875506, 0.5668468, 0.3329083, 0.8846516, 0.1474983, ...
$ A                        <dbl> 1.43544387, -0.04590939, 0.05247453, -1.25279578, 0.14783077, -0.14751959, -0.59096525, -0.25257485, 1...
$ Segment                  <int> 4, 3, 2, 3, 5, 2, 3, 1, 4, 2, 2, 2, 2, 4, 4, 1, 5, 2, 3, 3, 4, 2, 2, 3, 4, 4, 2, 3, 4, 2, 4, 4, 3, 2, ...
``

随机数据:

r <- function(x) {rnorm(x, 50, 2)}
set.seed(123)
diabetes.df <- data.frame(Age = round(r(10)), Outcome = as.character((r(10) < 50)*1))

> diabetes.df
   Age Outcome
1   49       0
2   50       0
3   53       0
4   50       0
5   50       1
6   53       0
7   51       0
8   47       1
9   49       0
10  49       1

然后pivot_wider()会做你想做的事:

df <- diabetes.df %>%
  select(Age, Outcome) %>%
  group_by(Age,Outcome) %>%
  dplyr::summarize(Diabetes_Count = n()) %>%
  filter(Outcome=="1"| Outcome == "0")

df = pivot_wider(df, names_from = c("Outcome"), values_from = "Diabetes_Count", names_prefix = "Outcome_", values_fill = 0)

> df
# A tibble: 5 x 3
# Groups:   Age [5]
    Age Outcome_1 Outcome_0
  <dbl>     <int>     <int>
1    47         1         0
2    49         1         2
3    50         1         2
4    51         0         1
5    53         0         2

> df %>% mutate(`Outcome_1/Outcome_0` = Outcome_1 / Outcome_0)
# A tibble: 5 x 4
# Groups:   Age [5]
    Age         Outcome_1         Outcome_0  `Outcome_1/Outcome_0`
  <dbl>             <int>             <int>                 <dbl>
1    47                 1                 0                 Inf  
2    49                 1                 2                 0.5
3    50                 1                 2                 0.5
4    51                 0                 1                 0  
5    53                 0                 2                 0