计算 R 中几个类别中两个类别的案例数?
Count the number of cases in two of several categories in R?
我有一个数据集,它描述了一个人的样本以及他们所患疾病的数量和类型。在这里,1 表示此人有病,0 表示此人没有病。 NA 表示缺失值。它看起来像这样:
图书馆(tidyverse)
df <- tribble(
~Heart_disease, ~Lung_disease, ~Bowel_disease, ~Nerve_disease, ~Liver_disease
, 0, 1, 0, 1, 0
, NA, 0, 0, 0, 0
, 1, 1, 1, 1, 0
, 0, 1, 0, 0, 1
, 1, 0, 0, 1, 0
, 0, 0, 1, NA, NA
, 1, 0, 0, 0, 0
, 0, 0, 1, 0, 1
, 0, 0, 0, 0, 0
, 0, 1, 1, 1, 1
)
Heart_disease Lung_disease Bowel_disease Nerve_disease Liver_disease
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 0 1 0
2 NA 0 0 0 0
3 1 1 1 1 0
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 NA NA
7 1 0 0 0 0
8 0 0 1 0 1
9 0 0 0 0 0
10 0 1 1 1 1
我想知道:
a) 有多少人患有两种疾病?
b) 有多少人患有三种或三种以上的疾病?
我如何使用 R 计算这个?
非常感谢您的帮助
所以,这是 dplyr
/ tidyverse
解决方案:
library(tidyverse)
df <- tribble(
~Heart_disease, ~Lung_disease, ~Bowel_disease, ~Nerve_disease, ~Liver_disease
, 0, 1, 0, 1, 0
, NA, 0, 0, 0, 0
, 1, 1, 1, 1, 0
, 0, 1, 0, 0, 1
, 1, 0, 0, 1, 0
, 0, 0, 1, NA, NA
, 1, 0, 0, 0, 0
, 0, 0, 1, 0, 1
, 0, 0, 0, 0, 0
, 0, 1, 1, 1, 1
)
df %>%
mutate(patientID = 1:nrow(.)) %>%
gather("disease", "occured", -patientID) %>%
group_by(patientID) %>%
summarise(nrDiseases = sum(occured, na.rm = TRUE)) %>%
arrange(nrDiseases) %>%
group_by(nrDiseases) %>%
summarise(howManyPeople = n())
nrDiseases howManyPeople
<dbl> <int>
1 0 2
2 1 2
3 2 4
4 4 2
如果不清楚,这是如何工作的:
%>%
应读作 "then"。尝试 运行 仅部分代码,以查看中间结果,例如这部分
df %>%
mutate(patientID = 1:nrow(.)) %>%
gather("disease", "occured", -patientID) %>%
group_by(patientID) %>%
summarise(nrDiseases = sum(occured, na.rm = TRUE))
会给你这个
patientID nrDiseases
<int> <dbl>
1 1 2
2 2 0
3 3 4
4 4 2
5 5 2
6 6 1
7 7 1
8 8 2
9 9 0
10 10 4
这是一种方法。我认为每个行号(行名)代表一个人。你想得到 rowSums()
行的总和。有了它,您就可以汇总数据。我计算了列中有多少行有 2,total
。我对另一个条件做了类似的事情。
library(dplyr)
mutate(mydf, total = rowSums(mydf, na.rm = T)) %>%
summarize(two = sum(total == 2), morethan3 = sum(total >= 3))
# two morethan3
#1 4 2
数据
mydf <- structure(list(Heart_disease = c(0L, NA, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L), Lung_disease = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 1L), Bowel_disease = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L), Nerve_disease = c(1L, 0L, 1L, 0L, 1L, NA, 0L, 0L, 0L, 1L
), Liver_disease = c(0L, 0L, 0L, 1L, 0L, NA, 0L, 1L, 0L, 1L)), class =
"data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
我有一个数据集,它描述了一个人的样本以及他们所患疾病的数量和类型。在这里,1 表示此人有病,0 表示此人没有病。 NA 表示缺失值。它看起来像这样:
图书馆(tidyverse)
df <- tribble(
~Heart_disease, ~Lung_disease, ~Bowel_disease, ~Nerve_disease, ~Liver_disease
, 0, 1, 0, 1, 0
, NA, 0, 0, 0, 0
, 1, 1, 1, 1, 0
, 0, 1, 0, 0, 1
, 1, 0, 0, 1, 0
, 0, 0, 1, NA, NA
, 1, 0, 0, 0, 0
, 0, 0, 1, 0, 1
, 0, 0, 0, 0, 0
, 0, 1, 1, 1, 1
)
Heart_disease Lung_disease Bowel_disease Nerve_disease Liver_disease
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 1 0 1 0
2 NA 0 0 0 0
3 1 1 1 1 0
4 0 1 0 0 1
5 1 0 0 1 0
6 0 0 1 NA NA
7 1 0 0 0 0
8 0 0 1 0 1
9 0 0 0 0 0
10 0 1 1 1 1
我想知道: a) 有多少人患有两种疾病? b) 有多少人患有三种或三种以上的疾病?
我如何使用 R 计算这个?
非常感谢您的帮助
所以,这是 dplyr
/ tidyverse
解决方案:
library(tidyverse)
df <- tribble(
~Heart_disease, ~Lung_disease, ~Bowel_disease, ~Nerve_disease, ~Liver_disease
, 0, 1, 0, 1, 0
, NA, 0, 0, 0, 0
, 1, 1, 1, 1, 0
, 0, 1, 0, 0, 1
, 1, 0, 0, 1, 0
, 0, 0, 1, NA, NA
, 1, 0, 0, 0, 0
, 0, 0, 1, 0, 1
, 0, 0, 0, 0, 0
, 0, 1, 1, 1, 1
)
df %>%
mutate(patientID = 1:nrow(.)) %>%
gather("disease", "occured", -patientID) %>%
group_by(patientID) %>%
summarise(nrDiseases = sum(occured, na.rm = TRUE)) %>%
arrange(nrDiseases) %>%
group_by(nrDiseases) %>%
summarise(howManyPeople = n())
nrDiseases howManyPeople
<dbl> <int>
1 0 2
2 1 2
3 2 4
4 4 2
如果不清楚,这是如何工作的:
%>%
应读作 "then"。尝试 运行 仅部分代码,以查看中间结果,例如这部分
df %>%
mutate(patientID = 1:nrow(.)) %>%
gather("disease", "occured", -patientID) %>%
group_by(patientID) %>%
summarise(nrDiseases = sum(occured, na.rm = TRUE))
会给你这个
patientID nrDiseases
<int> <dbl>
1 1 2
2 2 0
3 3 4
4 4 2
5 5 2
6 6 1
7 7 1
8 8 2
9 9 0
10 10 4
这是一种方法。我认为每个行号(行名)代表一个人。你想得到 rowSums()
行的总和。有了它,您就可以汇总数据。我计算了列中有多少行有 2,total
。我对另一个条件做了类似的事情。
library(dplyr)
mutate(mydf, total = rowSums(mydf, na.rm = T)) %>%
summarize(two = sum(total == 2), morethan3 = sum(total >= 3))
# two morethan3
#1 4 2
数据
mydf <- structure(list(Heart_disease = c(0L, NA, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L), Lung_disease = c(1L, 0L, 1L, 1L, 0L, 0L, 0L, 0L,
0L, 1L), Bowel_disease = c(0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L,
1L), Nerve_disease = c(1L, 0L, 1L, 0L, 1L, NA, 0L, 0L, 0L, 1L
), Liver_disease = c(0L, 0L, 0L, 1L, 0L, NA, 0L, 1L, 0L, 1L)), class =
"data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10"))