在多列中查找最高频率
Finding the Highest Frequencies in multiple columns
我有一个包含 10 列和 2000 行的 data_frame。我的示例数据如下所示:
rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
我正在尝试为 rs_id =0、1 和 2 找到最高组合(A.Ag、B.Bg、C.Ag)。我该如何实现?
输出将是
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/24:02 3 06:02/06:02 3 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
此方法将数据从宽格式重塑为长格式(同时融合 两个 测量列),为每个唯一组合选择具有最高 Ag
值的行rs_id
、Code
和 variable
。最后,结果再次从长格式重塑为宽格式,列顺序重新排列为 return 预期结果:
library(data.table)
cols <- c("Combination", "Ag")
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)][
, setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))]
rs_id Code Combination_A Ag_A Combination_B Ag_B Combination_C Ag_C Combination_D Ag_D
1: rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1
2: rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
3: rs_1 2 1:01/24:02 3 13:02/8:01 1 06:02/06:02 3 1:01/4:04 3
4: rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
5: rs_2 1 11:01/2:01 4 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
6: rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
编辑
OP 要求解释最后一个链式 data.table
表达式 setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))
.
此表达式通过引用对结果的列进行排序,即不进行复制。重塑多个 value.var
时,列按 value.var
:
分组
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)]
rs_id Code Combination_A Combination_B Combination_C Combination_D Ag_A Ag_B Ag_C Ag_D
1: rs_1 0 1:01/11:01 13:02/49:01 03:04/03:04 1:01/1:01 2 2 6 1
2: rs_1 1 1:01/2:01 13:02/57:01 03:04/7:01 1:01/3:01 6 1 2 1
3: rs_1 2 1:01/24:02 13:02/8:01 06:02/06:02 1:01/4:04 3 1 3 3
4: rs_2 0 1:01/3:01 14:01/7:02 06:02/2:02: 1:01/4:07 1 1 1 1
5: rs_2 1 11:01/2:01 15:01/15:01 06:02/3:03 1:01/7:01 4 1 1 2
6: rs_2 2 11:01/25:01 15:01/44:02 06:02/4:01 10:01/3:01 1 2 1 5
而 OP 希望输出按 variable
分组。所以所需的列顺序是
c(1, 2, 3, 7, 4, 8, 5, 9, 6, 10)
。
1
和 2
表示 id.var
列。 as.vector(outer(c(0, 4), 3:6, "+")))
只是一种节省输入 3, 7, 4, 8, 5, 9, 6, 10
.
的方法
outer(c(0, 4), 3:6, "+")
[,1] [,2] [,3] [,4]
[1,] 3 4 5 6
[2,] 7 8 9 10
as.vector(outer(c(0, 4), 3:6, "+"))
[1] 3 7 4 8 5 9 6 10
编辑 2
代码可以进一步精简。 c()
中不需要调用 as.vector()
,因为 c()
将数组转换为向量。所以,而不是
c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))
我们可以写
c(1:2, outer(c(0, 4), 3:6, "+"))
数据
请注意,我已经完成了最后两列缺失的列 headers。
library(data.table)
DF <- fread(
"rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag Combination_Dg D.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5"
)
我有一个包含 10 列和 2000 行的 data_frame。我的示例数据如下所示:
rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
我正在尝试为 rs_id =0、1 和 2 找到最高组合(A.Ag、B.Bg、C.Ag)。我该如何实现? 输出将是
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/24:02 3 06:02/06:02 3 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
此方法将数据从宽格式重塑为长格式(同时融合 两个 测量列),为每个唯一组合选择具有最高 Ag
值的行rs_id
、Code
和 variable
。最后,结果再次从长格式重塑为宽格式,列顺序重新排列为 return 预期结果:
library(data.table)
cols <- c("Combination", "Ag")
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)][
, setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))]
rs_id Code Combination_A Ag_A Combination_B Ag_B Combination_C Ag_C Combination_D Ag_D 1: rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/03:04 6 1:01/1:01 1 2: rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1 3: rs_1 2 1:01/24:02 3 13:02/8:01 1 06:02/06:02 3 1:01/4:04 3 4: rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1 5: rs_2 1 11:01/2:01 4 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2 6: rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5
编辑
OP 要求解释最后一个链式 data.table
表达式 setcolorder(.SD, c(1:2, as.vector(outer(c(0, 4), 3:6, "+"))))
.
此表达式通过引用对结果的列进行排序,即不进行复制。重塑多个 value.var
时,列按 value.var
:
melt(setDT(DF), measure.vars = patterns("Combination", "[A-D][.]Ag"),
value.name = cols)[
, variable := forcats::lvls_revalue(variable, LETTERS[1:4])][
, .SD[which.max(Ag)], by = .(rs_id, Code, variable)][
, dcast(.SD, rs_id + Code ~ variable, value.var = cols)]
rs_id Code Combination_A Combination_B Combination_C Combination_D Ag_A Ag_B Ag_C Ag_D 1: rs_1 0 1:01/11:01 13:02/49:01 03:04/03:04 1:01/1:01 2 2 6 1 2: rs_1 1 1:01/2:01 13:02/57:01 03:04/7:01 1:01/3:01 6 1 2 1 3: rs_1 2 1:01/24:02 13:02/8:01 06:02/06:02 1:01/4:04 3 1 3 3 4: rs_2 0 1:01/3:01 14:01/7:02 06:02/2:02: 1:01/4:07 1 1 1 1 5: rs_2 1 11:01/2:01 15:01/15:01 06:02/3:03 1:01/7:01 4 1 1 2 6: rs_2 2 11:01/25:01 15:01/44:02 06:02/4:01 10:01/3:01 1 2 1 5
而 OP 希望输出按 variable
分组。所以所需的列顺序是
c(1, 2, 3, 7, 4, 8, 5, 9, 6, 10)
。
1
和 2
表示 id.var
列。 as.vector(outer(c(0, 4), 3:6, "+")))
只是一种节省输入 3, 7, 4, 8, 5, 9, 6, 10
.
outer(c(0, 4), 3:6, "+")
[,1] [,2] [,3] [,4] [1,] 3 4 5 6 [2,] 7 8 9 10
as.vector(outer(c(0, 4), 3:6, "+"))
[1] 3 7 4 8 5 9 6 10
编辑 2
代码可以进一步精简。 c()
中不需要调用 as.vector()
,因为 c()
将数组转换为向量。所以,而不是
c(1:2, as.vector(outer(c(0, 4), 3:6, "+")))
我们可以写
c(1:2, outer(c(0, 4), 3:6, "+"))
数据
请注意,我已经完成了最后两列缺失的列 headers。
library(data.table)
DF <- fread(
"rs_id Code Combination_Ag A.Ag Combination_Bg B.Ag Combination_Cg C.Ag Combination_Dg D.Ag
rs_1 0 1:01/1:01 1 13:02/13:02 1 03:04/03:04 6 1:01/1:01 1
rs_1 0 1:01/11:01 2 13:02/49:01 2 03:04/15:02 1 1:01/15:01 1
rs_1 1 1:01/2:01 6 13:02/57:01 1 03:04/7:01 2 1:01/3:01 1
rs_1 2 1:01/2:05: 1 13:02/8:01 1 06:02/06:02 3 1:01/4:04 1
rs_1 2 1:01/24:02 3 14:01/14:02 1 06:02/15:02 1 1:01/4:04 3
rs_2 0 1:01/3:01 1 14:01/7:02 1 06:02/2:02: 1 1:01/4:07 1
rs_2 1 1:01/31:01 1 15:01/15:01 1 06:02/3:03 1 1:01/7:01 2
rs_2 1 11:01/2:01 4 15:01/18:01 1 06:02/3:04 1 10:01/14:01 1
rs_2 2 11:01/25:01 1 15:01/44:02 2 06:02/4:01 1 10:01/3:01 5"
)