在随时间重复的顺序数据中创建唯一组
Creating unique groups in sequential data that repeats through time
这种东西以前也有人问过,但不是我能找到的。
Thread about creating sequential IDs, with several additional links
按顺序创建标识符并不难,但我的数据包含一个让我陷入循环的时间元素。以下数据是一个虚构的数据集,只是为了以易于处理的方式说明问题:
dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L,
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve",
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4,
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood",
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street sequence visit
1 Uptown Hennepin 1 1
2 Uptown Dupont 2 1
3 Uptown Dupont 3 1
4 Uptown Lyndale 4 1
5 Uptown Dupont 5 2
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 14thAve 2 1
13 Dinkytown 12thAve 3 2
为了想象,所有数据均来自在明尼阿波利斯三个街区向东行走的三个人。每行代表记录其位置的时间。第一列是他们走过的街区。第二列是他们在每个时间点所在的交叉点。第三列是这些数据出现的顺序。
我想创建 visit
列来记录同一街道、同一街区的连续时间点,作为单次访问,随后 return 次访问作为下一次访问。如何创建这种顺序标识符?
我在想这个 ave()
和 FUN=seq_along
的技巧可能会奏效,但我找不到一种方法来组合让我到达我想要的位置的因素。
Create a sequential number (counter) for rows within each group of a dataframe [duplicate]
更新:Uwe 的解决方案有效,但如果有人决定在一个十字路口进行所有测量,就会被破坏,这就是我尝试将其放入真实数据时发生的情况。如果发生这种情况,则原始行数不会 return 编辑到最终的 data.table。看看这里发生了什么:
dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2",
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street2 sequence visit_2
1 Uptown Hennepin 1 1
2 Uptown Hennepin 2 1
3 Uptown Hennepin 3 1
4 Uptown Hennepin 4 1
5 Uptown Hennepin 5 1
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 12thAve 2 1
13 Dinkytown 12thAve 3 1
在这种情况下,运行 Uwe 的解决方案 return 只有 6 行。
library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
, unique(.SD, by = "visit_2")][
, visit_2 := rowid(neighborhood, street2)][
walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]
neighborhood street2 sequence visit visit_2
1: Uptown Hennepin 1 1 1
2: Downtown Hennepin 1 2 1
3: Downtown Marquette 2 3 1
4: Downtown Nicolette 3 4 1
5: Downtown Hennepin 4 5 2
6: Dinkytown 12thAve 1 6 1
# Not required, but convenient:
walking_dat$combo <- paste(walking_dat$neighborhood, walking_dat$street)
# Place holder:
walking_dat$visit <- NA
# Create it:
for(i in 1:nrow(walking_dat)){
if(i %in% row.names(walking_dat[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0)), ])){
walking_dat$visit[i] <- sum(walking_dat$combo[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0))][1:i]==walking_dat$combo[i], na.rm=T)
} else{
walking_dat$visit[i] <- 1
}
}
walking_dat
neighborhood street sequence visit combo
1 Uptown Hennepin 1 1 Uptown Hennepin
2 Uptown Dupont 2 1 Uptown Dupont
3 Uptown Dupont 3 1 Uptown Dupont
4 Uptown Lyndale 4 1 Uptown Lyndale
5 Uptown Dupont 5 2 Uptown Dupont
6 Downtown Hennepin 1 1 Downtown Hennepin
7 Downtown Marquette 2 1 Downtown Marquette
8 Downtown Nicolette 3 1 Downtown Nicolette
9 Downtown Hennepin 4 2 Downtown Hennepin
10 Downtown Hennepin 5 1 Downtown Hennepin
11 Dinkytown 12thAve 1 2 Dinkytown 12thAve
12 Dinkytown 14thAve 2 1 Dinkytown 14thAve
13 Dinkytown 12thAve 3 2 Dinkytown 12thAve
这里的难点在于,在同一街区的同一条街道上的后续记录应计为一次访问。这需要将这些行折叠成一个,计算对不同街区和街道的访问,最后将其扩展到原始行数。
请注意,包含预期结果的 visit
列不会 被覆盖,但会保留以与计算的 visit_new
列进行比较。
library(data.table)
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
neighborhood street sequence visit visit_new
1: Uptown Hennepin 1 1 1
2: Uptown Dupont 2 1 1
3: Uptown Dupont 3 1 1
4: Uptown Lyndale 4 1 1
5: Uptown Dupont 5 2 2
6: Downtown Hennepin 1 1 1
7: Downtown Marquette 2 1 1
8: Downtown Nicolette 3 1 1
9: Downtown Hennepin 4 2 2
10: Downtown Hennepin 5 2 2
11: Dinkytown 12thAve 1 1 1
12: Dinkytown 14thAve 2 1 1
13: Dinkytown 12thAve 3 2 2
分步说明
DF
被强制转换为 data.table。 rleid()
函数为街区和街道的变化创建唯一编号。
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][]
neighborhood street sequence visit visit_new
1: Uptown Hennepin 1 1 1
2: Uptown Dupont 2 1 2
3: Uptown Dupont 3 1 2
4: Uptown Lyndale 4 1 3
5: Uptown Dupont 5 2 4
6: Downtown Hennepin 1 1 5
7: Downtown Marquette 2 1 6
8: Downtown Nicolette 3 1 7
9: Downtown Hennepin 4 2 8
10: Downtown Hennepin 5 2 8
11: Dinkytown 12thAve 1 1 9
12: Dinkytown 14thAve 2 1 10
13: Dinkytown 12thAve 3 2 11
请注意,第 2 行和第 3 行以及第 9 行和第 10 行会重复。在创建新的临时 data.table 对象的下一步中删除重复项:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][]
neighborhood street sequence visit visit_new
1: Uptown Hennepin 1 1 1
2: Uptown Dupont 2 1 2
3: Uptown Lyndale 4 1 3
4: Uptown Dupont 5 2 4
5: Downtown Hennepin 1 1 5
6: Downtown Marquette 2 1 6
7: Downtown Nicolette 3 1 7
8: Downtown Hennepin 4 2 8
9: Dinkytown 12thAve 1 1 9
10: Dinkytown 14thAve 2 1 10
11: Dinkytown 12thAve 3 2 11
现在,我们可以使用 rowid()
函数对不同街区和街道的访问进行编号:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][]
neighborhood street sequence visit visit_new
1: Uptown Hennepin 1 1 1
2: Uptown Dupont 2 1 1
3: Uptown Lyndale 4 1 1
4: Uptown Dupont 5 2 2
5: Downtown Hennepin 1 1 1
6: Downtown Marquette 2 1 1
7: Downtown Nicolette 3 1 1
8: Downtown Hennepin 4 2 2
9: Dinkytown 12thAve 1 1 1
10: Dinkytown 14thAve 2 1 1
11: Dinkytown 12thAve 3 2 2
最后,我们需要将结果再次展开到原来的行数。这是通过临时data.table与原始DF
(包括所有行)的滚动连接来实现的:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
也许,值得注意的是 visit_new
被反复使用以在最终更新之前的各个阶段保存临时数据。
新数据集
固定代码也适用于 OP 提供的第二个数据集:
walking_dat_2 <-
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit = c(1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 1, 1, 1), visit_new = c(1L, 1L, 1L, 1L, 1L, 2L,
3L, 4L, 5L, 5L, 6L, 6L, 6L)), .Names = c("neighborhood", "street",
"sequence", "visit", "visit_new"), row.names = c(NA, -13L), class = "data.frame")
setDT(walking_dat_2)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat_2, on = .(neighborhood, street, sequence),
roll = TRUE, .SD]
neighborhood street sequence visit visit_new
1: Uptown Hennepin 1 1 1
2: Uptown Hennepin 2 1 1
3: Uptown Hennepin 3 1 1
4: Uptown Hennepin 4 1 1
5: Uptown Hennepin 5 1 1
6: Downtown Hennepin 1 1 1
7: Downtown Marquette 2 1 1
8: Downtown Nicolette 3 1 1
9: Downtown Hennepin 4 2 2
10: Downtown Hennepin 5 2 2
11: Dinkytown 12thAve 1 1 1
12: Dinkytown 12thAve 2 1 1
13: Dinkytown 12thAve 3 1 1
这种东西以前也有人问过,但不是我能找到的。
Thread about creating sequential IDs, with several additional links
按顺序创建标识符并不难,但我的数据包含一个让我陷入循环的时间元素。以下数据是一个虚构的数据集,只是为了以易于处理的方式说明问题:
dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L,
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve",
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4,
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood",
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street sequence visit
1 Uptown Hennepin 1 1
2 Uptown Dupont 2 1
3 Uptown Dupont 3 1
4 Uptown Lyndale 4 1
5 Uptown Dupont 5 2
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 14thAve 2 1
13 Dinkytown 12thAve 3 2
为了想象,所有数据均来自在明尼阿波利斯三个街区向东行走的三个人。每行代表记录其位置的时间。第一列是他们走过的街区。第二列是他们在每个时间点所在的交叉点。第三列是这些数据出现的顺序。
我想创建 visit
列来记录同一街道、同一街区的连续时间点,作为单次访问,随后 return 次访问作为下一次访问。如何创建这种顺序标识符?
我在想这个 ave()
和 FUN=seq_along
的技巧可能会奏效,但我找不到一种方法来组合让我到达我想要的位置的因素。
Create a sequential number (counter) for rows within each group of a dataframe [duplicate]
更新:Uwe 的解决方案有效,但如果有人决定在一个十字路口进行所有测量,就会被破坏,这就是我尝试将其放入真实数据时发生的情况。如果发生这种情况,则原始行数不会 return 编辑到最终的 data.table。看看这里发生了什么:
dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1,
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2",
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")
neighborhood street2 sequence visit_2
1 Uptown Hennepin 1 1
2 Uptown Hennepin 2 1
3 Uptown Hennepin 3 1
4 Uptown Hennepin 4 1
5 Uptown Hennepin 5 1
6 Downtown Hennepin 1 1
7 Downtown Marquette 2 1
8 Downtown Nicolette 3 1
9 Downtown Hennepin 4 2
10 Downtown Hennepin 5 2
11 Dinkytown 12thAve 1 1
12 Dinkytown 12thAve 2 1
13 Dinkytown 12thAve 3 1
在这种情况下,运行 Uwe 的解决方案 return 只有 6 行。
library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
, unique(.SD, by = "visit_2")][
, visit_2 := rowid(neighborhood, street2)][
walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]
neighborhood street2 sequence visit visit_2
1: Uptown Hennepin 1 1 1
2: Downtown Hennepin 1 2 1
3: Downtown Marquette 2 3 1
4: Downtown Nicolette 3 4 1
5: Downtown Hennepin 4 5 2
6: Dinkytown 12thAve 1 6 1
# Not required, but convenient:
walking_dat$combo <- paste(walking_dat$neighborhood, walking_dat$street)
# Place holder:
walking_dat$visit <- NA
# Create it:
for(i in 1:nrow(walking_dat)){
if(i %in% row.names(walking_dat[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0)), ])){
walking_dat$visit[i] <- sum(walking_dat$combo[with(walking_dat, c(TRUE, diff(as.numeric(interaction(neighborhood, street))) != 0))][1:i]==walking_dat$combo[i], na.rm=T)
} else{
walking_dat$visit[i] <- 1
}
}
walking_dat
neighborhood street sequence visit combo 1 Uptown Hennepin 1 1 Uptown Hennepin 2 Uptown Dupont 2 1 Uptown Dupont 3 Uptown Dupont 3 1 Uptown Dupont 4 Uptown Lyndale 4 1 Uptown Lyndale 5 Uptown Dupont 5 2 Uptown Dupont 6 Downtown Hennepin 1 1 Downtown Hennepin 7 Downtown Marquette 2 1 Downtown Marquette 8 Downtown Nicolette 3 1 Downtown Nicolette 9 Downtown Hennepin 4 2 Downtown Hennepin 10 Downtown Hennepin 5 1 Downtown Hennepin 11 Dinkytown 12thAve 1 2 Dinkytown 12thAve 12 Dinkytown 14thAve 2 1 Dinkytown 14thAve 13 Dinkytown 12thAve 3 2 Dinkytown 12thAve
这里的难点在于,在同一街区的同一条街道上的后续记录应计为一次访问。这需要将这些行折叠成一个,计算对不同街区和街道的访问,最后将其扩展到原始行数。
请注意,包含预期结果的 visit
列不会 被覆盖,但会保留以与计算的 visit_new
列进行比较。
library(data.table)
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 1 3: Uptown Dupont 3 1 1 4: Uptown Lyndale 4 1 1 5: Uptown Dupont 5 2 2 6: Downtown Hennepin 1 1 1 7: Downtown Marquette 2 1 1 8: Downtown Nicolette 3 1 1 9: Downtown Hennepin 4 2 2 10: Downtown Hennepin 5 2 2 11: Dinkytown 12thAve 1 1 1 12: Dinkytown 14thAve 2 1 1 13: Dinkytown 12thAve 3 2 2
分步说明
DF
被强制转换为 data.table。 rleid()
函数为街区和街道的变化创建唯一编号。
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 2 3: Uptown Dupont 3 1 2 4: Uptown Lyndale 4 1 3 5: Uptown Dupont 5 2 4 6: Downtown Hennepin 1 1 5 7: Downtown Marquette 2 1 6 8: Downtown Nicolette 3 1 7 9: Downtown Hennepin 4 2 8 10: Downtown Hennepin 5 2 8 11: Dinkytown 12thAve 1 1 9 12: Dinkytown 14thAve 2 1 10 13: Dinkytown 12thAve 3 2 11
请注意,第 2 行和第 3 行以及第 9 行和第 10 行会重复。在创建新的临时 data.table 对象的下一步中删除重复项:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 2 3: Uptown Lyndale 4 1 3 4: Uptown Dupont 5 2 4 5: Downtown Hennepin 1 1 5 6: Downtown Marquette 2 1 6 7: Downtown Nicolette 3 1 7 8: Downtown Hennepin 4 2 8 9: Dinkytown 12thAve 1 1 9 10: Dinkytown 14thAve 2 1 10 11: Dinkytown 12thAve 3 2 11
现在,我们可以使用 rowid()
函数对不同街区和街道的访问进行编号:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Dupont 2 1 1 3: Uptown Lyndale 4 1 1 4: Uptown Dupont 5 2 2 5: Downtown Hennepin 1 1 1 6: Downtown Marquette 2 1 1 7: Downtown Nicolette 3 1 1 8: Downtown Hennepin 4 2 2 9: Dinkytown 12thAve 1 1 1 10: Dinkytown 14thAve 2 1 1 11: Dinkytown 12thAve 3 2 2
最后,我们需要将结果再次展开到原来的行数。这是通过临时data.table与原始DF
(包括所有行)的滚动连接来实现的:
setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
也许,值得注意的是 visit_new
被反复使用以在最终更新之前的各个阶段保存临时数据。
新数据集
固定代码也适用于 OP 提供的第二个数据集:
walking_dat_2 <-
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown",
"Uptown"), class = "factor"), street = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve",
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1,
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit = c(1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 1, 1, 1), visit_new = c(1L, 1L, 1L, 1L, 1L, 2L,
3L, 4L, 5L, 5L, 6L, 6L, 6L)), .Names = c("neighborhood", "street",
"sequence", "visit", "visit_new"), row.names = c(NA, -13L), class = "data.frame")
setDT(walking_dat_2)[, visit_new := rleid(neighborhood, street)][
, unique(.SD, by = "visit_new")][
, visit_new := rowid(neighborhood, street)][
walking_dat_2, on = .(neighborhood, street, sequence),
roll = TRUE, .SD]
neighborhood street sequence visit visit_new 1: Uptown Hennepin 1 1 1 2: Uptown Hennepin 2 1 1 3: Uptown Hennepin 3 1 1 4: Uptown Hennepin 4 1 1 5: Uptown Hennepin 5 1 1 6: Downtown Hennepin 1 1 1 7: Downtown Marquette 2 1 1 8: Downtown Nicolette 3 1 1 9: Downtown Hennepin 4 2 2 10: Downtown Hennepin 5 2 2 11: Dinkytown 12thAve 1 1 1 12: Dinkytown 12thAve 2 1 1 13: Dinkytown 12thAve 3 1 1