对 R 中每个匹配的行范围执行操作?
Perform an operation on each matching span of rows in R?
我有一个包含以下格式的观察结果的数据框:
(我的实际数据有更多的列,但为了清楚起见,这些才是重要的)
head(sampleDF, 20)
Timestamp TimeIntoSession CorrelationGuid Position_x Position_z
1 11/22/2017 11:12:30 AM 1234.331 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.166 4.947
2 11/22/2017 11:12:30 AM 1234.397 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.155 4.902
3 11/22/2017 11:12:30 AM 1234.464 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.144 4.858
4 11/22/2017 11:12:30 AM 1234.547 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.133 4.810
5 11/22/2017 11:12:30 AM 1234.614 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.125 4.777
6 11/22/2017 11:12:30 AM 1234.697 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.118 4.741
7 11/22/2017 11:12:30 AM 1234.764 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.114 4.714
8 11/22/2017 11:12:30 AM 1234.847 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.114 4.683
9 11/22/2017 11:12:30 AM 1234.914 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.119 4.661
10 11/22/2017 11:12:30 AM 1234.997 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.128 4.639
11 11/22/2017 11:12:30 AM 327.341 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.105 4.099
12 11/22/2017 11:12:30 AM 327.480 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.104 4.100
13 11/22/2017 11:12:30 AM 327.557 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.100
14 11/22/2017 11:12:30 AM 327.640 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.099
15 11/22/2017 11:12:30 AM 327.723 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.099
16 11/22/2017 11:12:30 AM 327.807 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.098
17 11/22/2017 11:12:30 AM 327.890 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.102 4.097
18 11/22/2017 11:12:30 AM 327.957 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.101 4.096
19 11/22/2017 11:12:30 AM 328.040 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.099 4.095
20 11/22/2017 11:12:30 AM 328.123 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.096 4.094
对于特定 CorrelationGuid 中的每一行,我想找出由当前行中的 X 和 Z 值定义的位置的欧几里得范数与前一行的欧几里德范数之间的差异。
我可以像这样对整个数据框执行此操作:
norm_vec <- function(x,y) sqrt(x^2 + y^2)
sampleMag<- mutate(sampleDF, sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))
但这给出了每一行的差异;我想在每个 CorrelationGuid 中进行;也就是说,我不希望新的 CorrelationGuid 的第一行在进行计算时查看以前的 CorrelationGuid 的最后一行。
我可以这样尝试一个 CorrelationGuid:
sampleMag<- mutate(sampleDF, sqMag = ifelse(CorrelationGuid == "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
(norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0))), NA))
但这并不是我真正想要的;我想为 每个 CorrelationGuid 执行此操作,并且除了一个之外没有所有的 NA。
我可以使用 unique() 或 distinct() 轻松生成唯一 CorrelationGuid 值的列表,但是为每个唯一 CorrelationGuid 运行 上述逻辑一次的最佳方法是什么?
我可以找到每个 CorrelationGuid 的第一个和最后一个实例,然后遍历它,但是 for 循环在这里会非常慢,特别是如果这是在大型数据集上完成的。
apply 似乎很合适,但我不确定在这里如何最好地使用它。
感谢评论中的所有讨论,我相信您正在寻找@jazzurro 提供的内容...
sampleDF <- sampleDF %>% group_by(CorrelationGuid) %>%
mutate(sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))
sampleDF
使用以下来自 dput 的数据来重现数据
structure(list(CorrelationGuid = c("714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6"), Position_x = c(5.166,
5.155, 5.144, 5.133, 5.125, 5.118, 5.114, 5.114, 5.119, 5.128,
3.105, 3.104, 3.103, 3.103, 3.103, 3.103, 3.102, 3.101, 3.099,
3.096), Position_z = c(4.947, 4.902, 4.858, 4.81, 4.777, 4.741,
4.714, 4.683, 4.661, 4.639, 4.099, 4.1, 4.1, 4.099, 4.099, 4.098,
4.097, 4.096, 4.095, 4.094), sqMag = c(7.15264741197272, -0.0390246359183104,
-0.0382499943559047, -0.0409013019070033, -0.0283774181905558,
-0.0296332826672581, -0.0212618601391981, -0.0209732951217187,
-0.0111423510261082, -0.00812087217042912, 5.1422588421821, 0.000193491092365328,
-0.000603541254226236, -0.000797343143268847, 0, -0.000797272282198946,
-0.00140089984128977, -0.00140089254591658, -0.0020043998541075,
-0.00260744523674017)), .Names = c("CorrelationGuid", "Position_x",
"Position_z", "sqMag"), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -20L), vars = "CorrelationGuid", labels = structure(list(
CorrelationGuid = c("22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44")), class = "data.frame", row.names = c(NA,
-2L), vars = "CorrelationGuid", drop = TRUE, .Names = "CorrelationGuid"), indices = list(
10:19, 0:9), drop = TRUE, group_sizes = c(10L, 10L), biggest_group_size = 10L)
我有一个包含以下格式的观察结果的数据框:
(我的实际数据有更多的列,但为了清楚起见,这些才是重要的)
head(sampleDF, 20)
Timestamp TimeIntoSession CorrelationGuid Position_x Position_z
1 11/22/2017 11:12:30 AM 1234.331 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.166 4.947
2 11/22/2017 11:12:30 AM 1234.397 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.155 4.902
3 11/22/2017 11:12:30 AM 1234.464 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.144 4.858
4 11/22/2017 11:12:30 AM 1234.547 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.133 4.810
5 11/22/2017 11:12:30 AM 1234.614 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.125 4.777
6 11/22/2017 11:12:30 AM 1234.697 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.118 4.741
7 11/22/2017 11:12:30 AM 1234.764 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.114 4.714
8 11/22/2017 11:12:30 AM 1234.847 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.114 4.683
9 11/22/2017 11:12:30 AM 1234.914 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.119 4.661
10 11/22/2017 11:12:30 AM 1234.997 714e8a89-91a5-415b-b102-6ed5c0cf9f44 5.128 4.639
11 11/22/2017 11:12:30 AM 327.341 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.105 4.099
12 11/22/2017 11:12:30 AM 327.480 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.104 4.100
13 11/22/2017 11:12:30 AM 327.557 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.100
14 11/22/2017 11:12:30 AM 327.640 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.099
15 11/22/2017 11:12:30 AM 327.723 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.099
16 11/22/2017 11:12:30 AM 327.807 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.103 4.098
17 11/22/2017 11:12:30 AM 327.890 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.102 4.097
18 11/22/2017 11:12:30 AM 327.957 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.101 4.096
19 11/22/2017 11:12:30 AM 328.040 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.099 4.095
20 11/22/2017 11:12:30 AM 328.123 22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6 3.096 4.094
对于特定 CorrelationGuid 中的每一行,我想找出由当前行中的 X 和 Z 值定义的位置的欧几里得范数与前一行的欧几里德范数之间的差异。
我可以像这样对整个数据框执行此操作:
norm_vec <- function(x,y) sqrt(x^2 + y^2)
sampleMag<- mutate(sampleDF, sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))
但这给出了每一行的差异;我想在每个 CorrelationGuid 中进行;也就是说,我不希望新的 CorrelationGuid 的第一行在进行计算时查看以前的 CorrelationGuid 的最后一行。
我可以这样尝试一个 CorrelationGuid:
sampleMag<- mutate(sampleDF, sqMag = ifelse(CorrelationGuid == "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
(norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0))), NA))
但这并不是我真正想要的;我想为 每个 CorrelationGuid 执行此操作,并且除了一个之外没有所有的 NA。
我可以使用 unique() 或 distinct() 轻松生成唯一 CorrelationGuid 值的列表,但是为每个唯一 CorrelationGuid 运行 上述逻辑一次的最佳方法是什么?
我可以找到每个 CorrelationGuid 的第一个和最后一个实例,然后遍历它,但是 for 循环在这里会非常慢,特别是如果这是在大型数据集上完成的。
apply 似乎很合适,但我不确定在这里如何最好地使用它。
感谢评论中的所有讨论,我相信您正在寻找@jazzurro 提供的内容...
sampleDF <- sampleDF %>% group_by(CorrelationGuid) %>%
mutate(sqMag = norm_vec(Position_x, Position_z) - norm_vec(lag(Position_x, default = 0), lag(Position_z, default = 0)))
sampleDF
使用以下来自 dput 的数据来重现数据
structure(list(CorrelationGuid = c("714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "714e8a89-91a5-415b-b102-6ed5c0cf9f44",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6", "22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6"), Position_x = c(5.166,
5.155, 5.144, 5.133, 5.125, 5.118, 5.114, 5.114, 5.119, 5.128,
3.105, 3.104, 3.103, 3.103, 3.103, 3.103, 3.102, 3.101, 3.099,
3.096), Position_z = c(4.947, 4.902, 4.858, 4.81, 4.777, 4.741,
4.714, 4.683, 4.661, 4.639, 4.099, 4.1, 4.1, 4.099, 4.099, 4.098,
4.097, 4.096, 4.095, 4.094), sqMag = c(7.15264741197272, -0.0390246359183104,
-0.0382499943559047, -0.0409013019070033, -0.0283774181905558,
-0.0296332826672581, -0.0212618601391981, -0.0209732951217187,
-0.0111423510261082, -0.00812087217042912, 5.1422588421821, 0.000193491092365328,
-0.000603541254226236, -0.000797343143268847, 0, -0.000797272282198946,
-0.00140089984128977, -0.00140089254591658, -0.0020043998541075,
-0.00260744523674017)), .Names = c("CorrelationGuid", "Position_x",
"Position_z", "sqMag"), class = c("grouped_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -20L), vars = "CorrelationGuid", labels = structure(list(
CorrelationGuid = c("22f2f3bd-0750-4ccb-a5fc-e8f8a83d06f6",
"714e8a89-91a5-415b-b102-6ed5c0cf9f44")), class = "data.frame", row.names = c(NA,
-2L), vars = "CorrelationGuid", drop = TRUE, .Names = "CorrelationGuid"), indices = list(
10:19, 0:9), drop = TRUE, group_sizes = c(10L, 10L), biggest_group_size = 10L)