使用汇总函数添加条件组标识符
Add conditional group identifier using rollup functions
我有一个包含子序列(行组)的数据框
识别这些子序列的条件是观察列 diff 中的激增。这就是数据的样子:
> dput(test)
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click",
"mousedown", "mousemove", "mouseup"), class = "factor"),
deltas = structure(6:25, .Label = c("154875", "154878", "154880",
"155866", "155870", "38479", "38488", "38492", "38775", "45595",
"45602", "45606", "45987", "50280", "50285", "50288", "50646",
"54995", "55001", "55005", "55317", "59528", "59533", "59537",
"59921", "63392", "63403", "63408", "63822", "66706", "66710",
"66716", "67002", "73750", "73755", "73759", "74158", "77999",
"78003", "78006", "78076", "81360", "81367", "81371", "82381",
"93365", "93370", "93374", "93872"), class = "factor"),
serial = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20), diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 3, 358, 4349, 6, 4,
312, 4211, 5, 4, 384)),
.Names = c("vid", "events", "deltas", "serial", "diff"),
row.names = c(NA, 20L), class = "data.frame")
我正在尝试添加一个列来指示何时识别新的子序列并为整个子序列分配一个唯一的 ID。我将通过以下示例演示分组标准:
第 5 行的差异值为 6829,比该行之前的最大值 (283) 高 10 倍。
结果应该是这样的 df:
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click",
"mousedown", "mousemove", "mouseup"), class = "factor"),
deltas = structure(6:25, .Label = c("154875", "154878", "154880",
"155866", "155870", "38479", "38488", "38492", "38775", "45595",
"45602", "45606", "45987", "50280", "50285", "50288", "50646",
"54995", "55001", "55005", "55317", "59528", "59533", "59537",
"59921", "63392", "63403", "63408", "63822", "66706", "66710",
"66716", "67002", "73750", "73755", "73759", "74158", "77999",
"78003", "78006", "78076", "81360", "81367", "81371", "82381",
"93365", "93370", "93374", "93872"), class = "factor"), serial = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20),
diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5,
3, 358, 4349, 6, 4, 312, 4211, 5, 4, 384),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)),
.Names = c("vid", "events", "deltas", "serial", "diff", "group"),
row.names = c(NA, 20L), class = "data.frame")
非常感谢任何帮助
由用户 Gopala 提供:
df$group <- cumsum(df$diff > 500) + 1 怎么样(你指定的任何标准)。 – Gopala 31 分钟前
让我向您详细介绍一下它的工作原理和工作原理。
首先,让我们添加一个没有 cumsum
部分的列:
df$tag <- df$diff > 500
head(df)
vid events deltas serial diff tag
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE
如您所见,它只是在标记列中创建一个逻辑值 TRUE/FALSE,表明差异是否为 'big enough'(基于所选阈值)。
现在,当您对该列执行 cumsum
并将其存储在 group
列中时,它将继续累积添加。每个 TRUE 值都会使累积和增加 1,每个 FALSE 值都会使累积和与命中该行之前的值保持相同。
因此,这将为您提供所需的递增 group
值:
df$group <- cumsum(df$tag)
head(df)
vid events deltas serial diff tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE 0
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE 0
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE 0
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE 0
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE 1
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE 1
请注意组值从零开始。由于前几个 FALSE 值的累积和为零。但是,您可能希望您的组标识符以 1 开头。因此,我在 cumsum
中添加了一个 1,但您也可以按照以下步骤进行操作。
df$group <- df$group + 1
head(df)
vid events deltas serial diff tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE 1
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE 1
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE 1
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE 1
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE 2
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE 2
希望这对您有所帮助。
我有一个包含子序列(行组)的数据框 识别这些子序列的条件是观察列 diff 中的激增。这就是数据的样子:
> dput(test)
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click",
"mousedown", "mousemove", "mouseup"), class = "factor"),
deltas = structure(6:25, .Label = c("154875", "154878", "154880",
"155866", "155870", "38479", "38488", "38492", "38775", "45595",
"45602", "45606", "45987", "50280", "50285", "50288", "50646",
"54995", "55001", "55005", "55317", "59528", "59533", "59537",
"59921", "63392", "63403", "63408", "63822", "66706", "66710",
"66716", "67002", "73750", "73755", "73759", "74158", "77999",
"78003", "78006", "78076", "81360", "81367", "81371", "82381",
"93365", "93370", "93374", "93872"), class = "factor"),
serial = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20), diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 3, 358, 4349, 6, 4,
312, 4211, 5, 4, 384)),
.Names = c("vid", "events", "deltas", "serial", "diff"),
row.names = c(NA, 20L), class = "data.frame")
我正在尝试添加一个列来指示何时识别新的子序列并为整个子序列分配一个唯一的 ID。我将通过以下示例演示分组标准:
第 5 行的差异值为 6829,比该行之前的最大值 (283) 高 10 倍。
结果应该是这样的 df:
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click",
"mousedown", "mousemove", "mouseup"), class = "factor"),
deltas = structure(6:25, .Label = c("154875", "154878", "154880",
"155866", "155870", "38479", "38488", "38492", "38775", "45595",
"45602", "45606", "45987", "50280", "50285", "50288", "50646",
"54995", "55001", "55005", "55317", "59528", "59533", "59537",
"59921", "63392", "63403", "63408", "63822", "66706", "66710",
"66716", "67002", "73750", "73755", "73759", "74158", "77999",
"78003", "78006", "78076", "81360", "81367", "81371", "82381",
"93365", "93370", "93374", "93872"), class = "factor"), serial = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20),
diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5,
3, 358, 4349, 6, 4, 312, 4211, 5, 4, 384),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)),
.Names = c("vid", "events", "deltas", "serial", "diff", "group"),
row.names = c(NA, 20L), class = "data.frame")
非常感谢任何帮助
由用户 Gopala 提供: df$group <- cumsum(df$diff > 500) + 1 怎么样(你指定的任何标准)。 – Gopala 31 分钟前
让我向您详细介绍一下它的工作原理和工作原理。
首先,让我们添加一个没有 cumsum
部分的列:
df$tag <- df$diff > 500
head(df)
vid events deltas serial diff tag
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE
如您所见,它只是在标记列中创建一个逻辑值 TRUE/FALSE,表明差异是否为 'big enough'(基于所选阈值)。
现在,当您对该列执行 cumsum
并将其存储在 group
列中时,它将继续累积添加。每个 TRUE 值都会使累积和增加 1,每个 FALSE 值都会使累积和与命中该行之前的值保持相同。
因此,这将为您提供所需的递增 group
值:
df$group <- cumsum(df$tag)
head(df)
vid events deltas serial diff tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE 0
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE 0
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE 0
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE 0
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE 1
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE 1
请注意组值从零开始。由于前几个 FALSE 值的累积和为零。但是,您可能希望您的组标识符以 1 开头。因此,我在 cumsum
中添加了一个 1,但您也可以按照以下步骤进行操作。
df$group <- df$group + 1
head(df)
vid events deltas serial diff tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE 1
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE 1
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE 1
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE 1
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE 2
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE 2
希望这对您有所帮助。