Expand.grid p 值矩阵用 NA 填充相等的变量
Expand.grid p-value matrix fill equal variables with NA
我不得不 运行 对数据集中的分类数据进行大量卡方费舍尔检验。由于分类变量的数量,我知道这样做会花费大量时间,所以我在 here 上找到了一个函数并根据需要修改了它。
>HRchi
# A tibble: 6 x 13
Position State Sex MaritalDesc CitizenDesc HispanicLatino RaceDesc TermReason EmploymentStatus Department ManagerName RecruitmentSour~
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Productio~ MA "M " Single US Citizen No White N/A-StillE~ Active "Producti~ Michael Al~ LinkedIn
2 Sr. DBA MA "M " Married US Citizen No White career cha~ Voluntarily Term~ "IT/IS" Simon Roup Indeed
3 Productio~ MA "F" Married US Citizen No White hours Voluntarily Term~ "Producti~ Kissy Sull~ LinkedIn
4 Productio~ MA "F" Married US Citizen No White N/A-StillE~ Active "Producti~ Elijiah Gr~ Indeed
5 Productio~ MA "F" Divorced US Citizen No White return to ~ Voluntarily Term~ "Producti~ Webster Bu~ Google Search
6 Productio~ MA "F" Single US Citizen No White N/A-StillE~ Active "Producti~ Amy Dunn LinkedIn
# ... with 1 more variable: PerformanceScore <chr>
>
我用来运行测试的函数如下
col_combinations <- expand.grid(names(HRchi), names(HRchi))
cor_test_wrapper <- function(col_name1, col_name2, data_frame) {
format(fisher.test(data_frame[[col_name1]], data_frame[[col_name2]],
simulate.p.value = TRUE, B = 1e6)$p.value, scientific = F)
}
p_vals <- mapply(cor_test_wrapper,
col_name1 = col_combinations[[1]],
col_name2 = col_combinations[[2]],
MoreArgs = list(data_frame = HRchi))
Ficher.pvalue.matrix <- matrix(p_vals, 13, 13, dimnames = list(names(HRchi), names(HRchi)))
Ficher.pvalue.matrix
这个 returns p 值矩阵:
rowname Position State Sex MaritalDesc CitizenDesc HispanicLatino RaceDesc TermReason EmploymentStatus Department ManagerName RecruitmentSour~
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Positi~ 0.00000~ 0.00~ 0.31~ 0.8194522 0.6830553 0.03777396 0.16237~ 0.9216931 0.01563398 0.0000009~ 0.00000099~ 0.000002999997
2 State 0.00000~ 0.00~ 0.14~ 0.5327625 0.4954165 0.4240866 0.00748~ 0.980687 0.8377042 0.0000009~ 0.00000099~ 0.02947497
3 Sex 0.31226~ 0.14~ 0.00~ 0.6979593 0.6987973 0.8145132 0.94932~ 0.6053784 0.959038 0.2443258 0.06263294 0.1271179
4 Marita~ 0.81893~ 0.53~ 0.69~ 0.00000099~ 0.9265121 0.5331945 0.48005~ 0.0059059~ 0.008646991 0.7705712 0.8863871 0.2533087
5 Citize~ 0.68347~ 0.49~ 0.70~ 0.9270521 0.00000099~ 1 0.05806~ 0.1407349 0.2222708 0.4063666 0.8475872 0.1891118
6 Hispan~ 0.03778~ 0.42~ 0.81~ 0.5330425 1 0.000000999999 0.04130~ 0.8368642 1 0.05423295 0.1162419 0.06414394
7 RaceDe~ 0.16164~ 0.00~ 0.94~ 0.4804555 0.05764794 0.04088996 0.00000~ 0.972402 0.8328322 0.08990291 0.01743098 0.000000999999
8 TermRe~ 0.92143~ 0.98~ 0.60~ 0.005702994 0.1414139 0.8366842 0.97238 0.0000009~ 0.000000999999 0.2481378 0.7842482 0.0002929997
9 Employ~ 0.01571~ 0.83~ 0.95~ 0.008722991 0.2230458 1 0.83268~ 0.0000009~ 0.000000999999 0.0025569~ 0.001606998 0.000000999999
10 Depart~ 0.00000~ 0.00~ 0.24~ 0.7694292 0.4063906 0.05454395 0.09036~ 0.2486848 0.002619997 0.0000009~ 0.00000099~ 0.000000999999
11 Manage~ 0.00000~ 0.00~ 0.06~ 0.8851031 0.8472942 0.1168469 0.01726~ 0.7852542 0.001648998 0.0000009~ 0.00000099~ 0.000001999998
12 Recrui~ 0.00000~ 0.02~ 0.12~ 0.2529637 0.1878758 0.06357094 0.00000~ 0.0003429~ 0.000002999997 0.0000009~ 0.00000099~ 0.000000999999
13 Perfor~ 0.76044~ 0.56~ 0.47~ 0.9184571 0.7584852 1 0.15887~ 0.06789893 0.003164997 0.6032454 0.2900097 0.3136187
# ... with 1 more variable: PerformanceScore <chr>
我想知道的是,是否可以让对角线以上的所有内容(Position = Position,State = State,等等)都等于 NA,这样数据框就不会那么混乱了。
您可以使用 diag
函数替换对角线上的值。例如:
# Create example matrix (correlation matrix of mtcars data)
myMatrix <- cor(mtcars)
# Replace diagonal with NA
diag(mtcars) <- NA
要更改上下对角线:
# Upper
myMatrix[upper.tri(myMatrix)] <- NA
# Lower
myMatrix[lower.tri(myMatrix)] <- NA
你可以使用 upper.tri
:
Ficher.pvalue.matrix[upper.tri(Ficher.pvalue.matrix)]<-NA
我不得不 运行 对数据集中的分类数据进行大量卡方费舍尔检验。由于分类变量的数量,我知道这样做会花费大量时间,所以我在 here 上找到了一个函数并根据需要修改了它。
>HRchi
# A tibble: 6 x 13
Position State Sex MaritalDesc CitizenDesc HispanicLatino RaceDesc TermReason EmploymentStatus Department ManagerName RecruitmentSour~
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Productio~ MA "M " Single US Citizen No White N/A-StillE~ Active "Producti~ Michael Al~ LinkedIn
2 Sr. DBA MA "M " Married US Citizen No White career cha~ Voluntarily Term~ "IT/IS" Simon Roup Indeed
3 Productio~ MA "F" Married US Citizen No White hours Voluntarily Term~ "Producti~ Kissy Sull~ LinkedIn
4 Productio~ MA "F" Married US Citizen No White N/A-StillE~ Active "Producti~ Elijiah Gr~ Indeed
5 Productio~ MA "F" Divorced US Citizen No White return to ~ Voluntarily Term~ "Producti~ Webster Bu~ Google Search
6 Productio~ MA "F" Single US Citizen No White N/A-StillE~ Active "Producti~ Amy Dunn LinkedIn
# ... with 1 more variable: PerformanceScore <chr>
>
我用来运行测试的函数如下
col_combinations <- expand.grid(names(HRchi), names(HRchi))
cor_test_wrapper <- function(col_name1, col_name2, data_frame) {
format(fisher.test(data_frame[[col_name1]], data_frame[[col_name2]],
simulate.p.value = TRUE, B = 1e6)$p.value, scientific = F)
}
p_vals <- mapply(cor_test_wrapper,
col_name1 = col_combinations[[1]],
col_name2 = col_combinations[[2]],
MoreArgs = list(data_frame = HRchi))
Ficher.pvalue.matrix <- matrix(p_vals, 13, 13, dimnames = list(names(HRchi), names(HRchi)))
Ficher.pvalue.matrix
这个 returns p 值矩阵:
rowname Position State Sex MaritalDesc CitizenDesc HispanicLatino RaceDesc TermReason EmploymentStatus Department ManagerName RecruitmentSour~
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Positi~ 0.00000~ 0.00~ 0.31~ 0.8194522 0.6830553 0.03777396 0.16237~ 0.9216931 0.01563398 0.0000009~ 0.00000099~ 0.000002999997
2 State 0.00000~ 0.00~ 0.14~ 0.5327625 0.4954165 0.4240866 0.00748~ 0.980687 0.8377042 0.0000009~ 0.00000099~ 0.02947497
3 Sex 0.31226~ 0.14~ 0.00~ 0.6979593 0.6987973 0.8145132 0.94932~ 0.6053784 0.959038 0.2443258 0.06263294 0.1271179
4 Marita~ 0.81893~ 0.53~ 0.69~ 0.00000099~ 0.9265121 0.5331945 0.48005~ 0.0059059~ 0.008646991 0.7705712 0.8863871 0.2533087
5 Citize~ 0.68347~ 0.49~ 0.70~ 0.9270521 0.00000099~ 1 0.05806~ 0.1407349 0.2222708 0.4063666 0.8475872 0.1891118
6 Hispan~ 0.03778~ 0.42~ 0.81~ 0.5330425 1 0.000000999999 0.04130~ 0.8368642 1 0.05423295 0.1162419 0.06414394
7 RaceDe~ 0.16164~ 0.00~ 0.94~ 0.4804555 0.05764794 0.04088996 0.00000~ 0.972402 0.8328322 0.08990291 0.01743098 0.000000999999
8 TermRe~ 0.92143~ 0.98~ 0.60~ 0.005702994 0.1414139 0.8366842 0.97238 0.0000009~ 0.000000999999 0.2481378 0.7842482 0.0002929997
9 Employ~ 0.01571~ 0.83~ 0.95~ 0.008722991 0.2230458 1 0.83268~ 0.0000009~ 0.000000999999 0.0025569~ 0.001606998 0.000000999999
10 Depart~ 0.00000~ 0.00~ 0.24~ 0.7694292 0.4063906 0.05454395 0.09036~ 0.2486848 0.002619997 0.0000009~ 0.00000099~ 0.000000999999
11 Manage~ 0.00000~ 0.00~ 0.06~ 0.8851031 0.8472942 0.1168469 0.01726~ 0.7852542 0.001648998 0.0000009~ 0.00000099~ 0.000001999998
12 Recrui~ 0.00000~ 0.02~ 0.12~ 0.2529637 0.1878758 0.06357094 0.00000~ 0.0003429~ 0.000002999997 0.0000009~ 0.00000099~ 0.000000999999
13 Perfor~ 0.76044~ 0.56~ 0.47~ 0.9184571 0.7584852 1 0.15887~ 0.06789893 0.003164997 0.6032454 0.2900097 0.3136187
# ... with 1 more variable: PerformanceScore <chr>
我想知道的是,是否可以让对角线以上的所有内容(Position = Position,State = State,等等)都等于 NA,这样数据框就不会那么混乱了。
您可以使用 diag
函数替换对角线上的值。例如:
# Create example matrix (correlation matrix of mtcars data)
myMatrix <- cor(mtcars)
# Replace diagonal with NA
diag(mtcars) <- NA
要更改上下对角线:
# Upper
myMatrix[upper.tri(myMatrix)] <- NA
# Lower
myMatrix[lower.tri(myMatrix)] <- NA
你可以使用 upper.tri
:
Ficher.pvalue.matrix[upper.tri(Ficher.pvalue.matrix)]<-NA