按列分组的数据框中 R 中行之间的差异
Difference between rows in R on dataframe grouped by column
我希望通过 app_name 了解不同版本的计数差异。我的数据集如下所示:app_name、version_id、计数、[差异]
这是数据集
data = structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1,
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L,
200L, 200L, 250L, 250L, 15L, 36L)), .Names = c("app_name", "version_id",
"count"), class = "data.frame", row.names = c(NA, -9L))
鉴于此 data.frame,我怎样才能得到 app_name 和 version_id 的滞后计数差异?每个应用程序的初始(第一个)版本差异将为零,因为没有区别。
这是最后一个 'diff' 列的最终结果示例
structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1,
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L,
200L, 200L, 250L, 250L, 15L, 36L), diff = c(0, 20, 0, 0, 0, 1.25,
0, 0, 2.4)), .Names = c("app_name", "version_id", "count", "diff"
), class = "data.frame", row.names = c(NA, -9L))
尝试使用 dplyr
和 lag
:
library(dplyr)
data %>% group_by(app_name) %>%
mutate(diffvers = version_id - dplyr::lag(version_id, default = version_id[1]),
diffcount = count - dplyr::lag(count, default = count[1]))
Source: local data frame [9 x 5]
Groups: app_name [3]
app_name version_id count diffvers diffcount
(fctr) (dbl) (int) (dbl) (int)
1 a 1.0 600 0.0 0
2 a 1.1 620 0.1 20
3 a 2.3 620 1.2 0
4 b 2.0 200 0.0 0
5 b 3.1 200 1.1 0
6 b 3.3 250 0.2 50
7 b 4.0 250 0.7 0
8 c 1.1 15 0.0 0
9 c 2.4 36 1.3 21
我们可以使用 data.table
。我们将'data.frame'转换为'data.table'(setDT(data)
),按'app_name'分组,循环(lapply(..
).SDcols
中指定的列,得到当前元素与其 lag
之间的差异(默认情况下 shift
具有 type='lag'
)并分配 (:=
) 输出以创建新列。
library(data.table)#v1.9.6
setDT(data)[, c('diffvers', 'diffcount') := lapply(.SD,
function(x) x-shift(x, fill=x[1L])), by = app_name, .SDcols=2:3]
data
# app_name version_id count diffvers diffcount
#1: a 1.0 600 0.0 0
#2: a 1.1 620 0.1 20
#3: a 2.3 620 1.2 0
#4: b 2.0 200 0.0 0
#5: b 3.1 200 1.1 0
#6: b 3.3 250 0.2 50
#7: b 4.0 250 0.7 0
#8: c 1.1 15 0.0 0
#9: c 2.4 36 1.3 21
我希望通过 app_name 了解不同版本的计数差异。我的数据集如下所示:app_name、version_id、计数、[差异]
这是数据集
data = structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1,
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L,
200L, 200L, 250L, 250L, 15L, 36L)), .Names = c("app_name", "version_id",
"count"), class = "data.frame", row.names = c(NA, -9L))
鉴于此 data.frame,我怎样才能得到 app_name 和 version_id 的滞后计数差异?每个应用程序的初始(第一个)版本差异将为零,因为没有区别。
这是最后一个 'diff' 列的最终结果示例
structure(list(app_name = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), version_id = c(1,
1.1, 2.3, 2, 3.1, 3.3, 4, 1.1, 2.4), count = c(600L, 620L, 620L,
200L, 200L, 250L, 250L, 15L, 36L), diff = c(0, 20, 0, 0, 0, 1.25,
0, 0, 2.4)), .Names = c("app_name", "version_id", "count", "diff"
), class = "data.frame", row.names = c(NA, -9L))
尝试使用 dplyr
和 lag
:
library(dplyr)
data %>% group_by(app_name) %>%
mutate(diffvers = version_id - dplyr::lag(version_id, default = version_id[1]),
diffcount = count - dplyr::lag(count, default = count[1]))
Source: local data frame [9 x 5]
Groups: app_name [3]
app_name version_id count diffvers diffcount
(fctr) (dbl) (int) (dbl) (int)
1 a 1.0 600 0.0 0
2 a 1.1 620 0.1 20
3 a 2.3 620 1.2 0
4 b 2.0 200 0.0 0
5 b 3.1 200 1.1 0
6 b 3.3 250 0.2 50
7 b 4.0 250 0.7 0
8 c 1.1 15 0.0 0
9 c 2.4 36 1.3 21
我们可以使用 data.table
。我们将'data.frame'转换为'data.table'(setDT(data)
),按'app_name'分组,循环(lapply(..
).SDcols
中指定的列,得到当前元素与其 lag
之间的差异(默认情况下 shift
具有 type='lag'
)并分配 (:=
) 输出以创建新列。
library(data.table)#v1.9.6
setDT(data)[, c('diffvers', 'diffcount') := lapply(.SD,
function(x) x-shift(x, fill=x[1L])), by = app_name, .SDcols=2:3]
data
# app_name version_id count diffvers diffcount
#1: a 1.0 600 0.0 0
#2: a 1.1 620 0.1 20
#3: a 2.3 620 1.2 0
#4: b 2.0 200 0.0 0
#5: b 3.1 200 1.1 0
#6: b 3.3 250 0.2 50
#7: b 4.0 250 0.7 0
#8: c 1.1 15 0.0 0
#9: c 2.4 36 1.3 21