根据 Dataframe 中的实例计数创建 2x2 条件 Table
Create 2x2 Conditional Table from Counting Instances in Dataframe
我有一个数据框,按作者的性别、他们在项目中的角色和标识符 (PMID)(见下文)细分。
我需要创建一个 2x2 的偶然事件 table,这样我就可以计算女性第一作者与女性高级作者之间关联的优势比。为此,我需要计算以下内容:
- A:第一作者为女性且高级作者为女性的次数
- B:第一作者为女性且资深作者为男性的次数
- C:第一作者为男性且高级作者为男性的次数
- D:第一作者为男性且高级作者为女性的次数
(很明显,如果只有一名高级人员或只有一名
每个 PMID 的第一作者)
我已经按 PMID 对 table(见下文)进行了分组,所以我真的只需要弄清楚如何计算上面的每个实例。遇到困难,不胜感激!
# A tibble: 178,056 x 3
# Groups: pmid [101,907]
gender authorship pmid
<chr> <chr> <chr>
1 male First 18958667
2 male Senior 18958667
3 male First 18958651
4 male First 18751818
5 male Senior 18751818
6 male First 18751811
7 male Senior 18751811
8 female First 18751810
9 female Senior 18751810
10 male First 18088800
11 male Senior 18088800
12 male First 17710072
13 female First 17977216
14 male Senior 17762065
15 male First 17611457
16 male First 17611433
17 male First 17532688
18 male Senior 17532688
19 female First 17405310
20 male Senior 17386862
21 female First 17319096
22 male Senior 17319096
23 female First 17300028
24 male First 17282480
25 female First 17177771
26 male First 17124681
27 female First 17093906
28 female First 17042011
29 male Senior 17042011
30 female First 17042010
31 male Senior 17042010
32 female First 17042006
33 male Senior 17042006
34 female First 17042003
35 female First 17042002
36 male Senior 17042002
37 male First 17042001
38 female First 17041999
39 male Senior 17041997
40 female First 17041995
41 female First 17041994
42 female First 17041993
43 female Senior 17041993
44 female First 17041992
45 female Senior 17041992
46 female First 17041991
47 male First 17041990
48 male Senior 17041990
49 male First 17041989
50 male Senior 17041989
pivot_wider
给出了一个简洁的解决方案
library(dplyr)
library(tidyr)
newdf <-
mydf %>%
group_by(pmid) %>%
pivot_wider(names_from = authorship,
values_from = gender)
table(newdf$First, newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442
newdf %>%
filter(!is.na(First) & !is.na(Senior))
#> # A tibble: 15 x 3
#> # Groups: pmid [15]
#> pmid First Senior
#> <chr> <chr> <chr>
#> 1 18958667 male male
#> 2 18751818 male male
#> 3 18751811 male male
#> 4 18751810 female female
#> 5 18088800 male male
#> 6 17532688 male male
#> 7 17319096 female male
#> 8 17042011 female male
#> 9 17042010 female male
#> 10 17042006 female male
#> 11 17042002 female male
#> 12 17041993 female female
#> 13 17041992 female female
#> 14 17041990 male male
#> 15 17041989 male male
table(newdf$First, newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442
您的数据
mydf <- tibble(
gender = c("male", "male", "male", "male", "male",
"male", "male", "female", "female", "male", "male", "male", "female",
"male", "male", "male", "male", "male", "female", "male", "female",
"male", "female", "male", "female", "male", "female", "female",
"male", "female", "male", "female", "male", "female", "female",
"male", "male", "female", "male", "female", "female", "female",
"female", "female", "female", "female", "male", "male", "male",
"male"),
authorship = c("First", "Senior", "First", "First",
"Senior", "First", "Senior", "First", "Senior", "First", "Senior",
"First", "First", "Senior", "First", "First", "First", "Senior",
"First", "Senior", "First", "Senior", "First", "First", "First",
"First", "First", "First", "Senior", "First", "Senior", "First",
"Senior", "First", "First", "Senior", "First", "First", "Senior",
"First", "First", "First", "Senior", "First", "Senior", "First",
"First", "Senior", "First", "Senior"),
pmid = c("18958667", "18958667",
"18958651", "18751818", "18751818", "18751811", "18751811", "18751810",
"18751810", "18088800", "18088800", "17710072", "17977216", "17762065",
"17611457", "17611433", "17532688", "17532688", "17405310", "17386862",
"17319096", "17319096", "17300028", "17282480", "17177771", "17124681",
"17093906", "17042011", "17042011", "17042010", "17042010", "17042006",
"17042006", "17042003", "17042002", "17042002", "17042001", "17041999",
"17041997", "17041995", "17041994", "17041993", "17041993", "17041992",
"17041992", "17041991", "17041990", "17041990", "17041989", "17041989")
)
使用 golf-coding 的整洁解决方案:
library(tidyr)
contig_table <- mydf %>%
spread(authorship, gender) %>%
#Only need drop_na() if data is incomplete
#Given that you have a lot more rows I assume this will not be needed
drop_na() %$%
table(First, Senior)
我有一个数据框,按作者的性别、他们在项目中的角色和标识符 (PMID)(见下文)细分。
我需要创建一个 2x2 的偶然事件 table,这样我就可以计算女性第一作者与女性高级作者之间关联的优势比。为此,我需要计算以下内容:
- A:第一作者为女性且高级作者为女性的次数
- B:第一作者为女性且资深作者为男性的次数
- C:第一作者为男性且高级作者为男性的次数
- D:第一作者为男性且高级作者为女性的次数 (很明显,如果只有一名高级人员或只有一名 每个 PMID 的第一作者)
我已经按 PMID 对 table(见下文)进行了分组,所以我真的只需要弄清楚如何计算上面的每个实例。遇到困难,不胜感激!
# A tibble: 178,056 x 3
# Groups: pmid [101,907]
gender authorship pmid
<chr> <chr> <chr>
1 male First 18958667
2 male Senior 18958667
3 male First 18958651
4 male First 18751818
5 male Senior 18751818
6 male First 18751811
7 male Senior 18751811
8 female First 18751810
9 female Senior 18751810
10 male First 18088800
11 male Senior 18088800
12 male First 17710072
13 female First 17977216
14 male Senior 17762065
15 male First 17611457
16 male First 17611433
17 male First 17532688
18 male Senior 17532688
19 female First 17405310
20 male Senior 17386862
21 female First 17319096
22 male Senior 17319096
23 female First 17300028
24 male First 17282480
25 female First 17177771
26 male First 17124681
27 female First 17093906
28 female First 17042011
29 male Senior 17042011
30 female First 17042010
31 male Senior 17042010
32 female First 17042006
33 male Senior 17042006
34 female First 17042003
35 female First 17042002
36 male Senior 17042002
37 male First 17042001
38 female First 17041999
39 male Senior 17041997
40 female First 17041995
41 female First 17041994
42 female First 17041993
43 female Senior 17041993
44 female First 17041992
45 female Senior 17041992
46 female First 17041991
47 male First 17041990
48 male Senior 17041990
49 male First 17041989
50 male Senior 17041989
pivot_wider
library(dplyr)
library(tidyr)
newdf <-
mydf %>%
group_by(pmid) %>%
pivot_wider(names_from = authorship,
values_from = gender)
table(newdf$First, newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442
newdf %>%
filter(!is.na(First) & !is.na(Senior))
#> # A tibble: 15 x 3
#> # Groups: pmid [15]
#> pmid First Senior
#> <chr> <chr> <chr>
#> 1 18958667 male male
#> 2 18751818 male male
#> 3 18751811 male male
#> 4 18751810 female female
#> 5 18088800 male male
#> 6 17532688 male male
#> 7 17319096 female male
#> 8 17042011 female male
#> 9 17042010 female male
#> 10 17042006 female male
#> 11 17042002 female male
#> 12 17041993 female female
#> 13 17041992 female female
#> 14 17041990 male male
#> 15 17041989 male male
table(newdf$First, newdf$Senior)
#>
#> female male
#> female 3 5
#> male 0 7
chisq.test(table(newdf$First, newdf$Senior))
#> Warning in chisq.test(table(newdf$First, newdf$Senior)): Chi-squared
#> approximation may be incorrect
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(newdf$First, newdf$Senior)
#> X-squared = 1.356, df = 1, p-value = 0.2442
您的数据
mydf <- tibble(
gender = c("male", "male", "male", "male", "male",
"male", "male", "female", "female", "male", "male", "male", "female",
"male", "male", "male", "male", "male", "female", "male", "female",
"male", "female", "male", "female", "male", "female", "female",
"male", "female", "male", "female", "male", "female", "female",
"male", "male", "female", "male", "female", "female", "female",
"female", "female", "female", "female", "male", "male", "male",
"male"),
authorship = c("First", "Senior", "First", "First",
"Senior", "First", "Senior", "First", "Senior", "First", "Senior",
"First", "First", "Senior", "First", "First", "First", "Senior",
"First", "Senior", "First", "Senior", "First", "First", "First",
"First", "First", "First", "Senior", "First", "Senior", "First",
"Senior", "First", "First", "Senior", "First", "First", "Senior",
"First", "First", "First", "Senior", "First", "Senior", "First",
"First", "Senior", "First", "Senior"),
pmid = c("18958667", "18958667",
"18958651", "18751818", "18751818", "18751811", "18751811", "18751810",
"18751810", "18088800", "18088800", "17710072", "17977216", "17762065",
"17611457", "17611433", "17532688", "17532688", "17405310", "17386862",
"17319096", "17319096", "17300028", "17282480", "17177771", "17124681",
"17093906", "17042011", "17042011", "17042010", "17042010", "17042006",
"17042006", "17042003", "17042002", "17042002", "17042001", "17041999",
"17041997", "17041995", "17041994", "17041993", "17041993", "17041992",
"17041992", "17041991", "17041990", "17041990", "17041989", "17041989")
)
使用 golf-coding 的整洁解决方案:
library(tidyr)
contig_table <- mydf %>%
spread(authorship, gender) %>%
#Only need drop_na() if data is incomplete
#Given that you have a lot more rows I assume this will not be needed
drop_na() %$%
table(First, Senior)