非均匀子集的总和

Question

在我的项目中，我有一堆关于巴士公司的信息。我按日期分隔了一个子集，因此我可以从条形图中看到最需要的公交线路（在 "Linha" 列中）。

->例如子集：

data.date[[1]] is equivalent of the subset of rows that have the date "2013-03-10".

为了实现这一点，我尝试将 dim "Catraca"（检票口）中的所有值求和到一个向量中，用于所有不同的 "Linhas"（公交线路）。而且，我正在努力奋斗。

这是我使用的逻辑

linha.sum <- with(data.date[[1]], data.date[[1]] == linha.unique, sum(data.date[[1]]$Catraca))

输出是一些逻辑向量。不理想。

这些图片可能有助于您了解情况

 View(data.date[[1]])

我要求和的值是不同"Linha"

的"Catraca"

数据样本：

data.dates <- list(read.table(text = "Linha     DSaida HSaida   DChegada HChegada Sentido Catraca Embarcado
                                          3 2016-01-01  04:05 2016-01-01    04:15       0       0         0
                                          3 2016-01-01  04:23 2016-01-01    23:57       0      37         0
                                          3 2016-01-01  04:05 2016-01-01    04:15       0       0         0
                                          3 2016-01-01  04:22 2016-01-01    23:58       0      83         0
                                          3 2016-01-01  04:04 2016-01-01    04:15       0       0         0
                                          3 2016-01-01  04:23 2016-01-01    23:58       0      43         0
                                          6 2016-01-01  03:49 2016-01-01    13:41       0      82         0
                                          6 2016-01-01  13:43 2016-01-01    23:09       0      98         0
                                          7 2016-01-01  03:54 2016-01-01    14:49       0      61         0
                                          7 2016-01-01  14:51 2016-01-01    23:10       0      46         0", header = T))

Answer 1

由于 data.dates 似乎是 data.frames 的列表（可能由 split() 创建），每个数据集中的列的总和可以用 lapply.

这是一些可重现的数据：

data.dates <- list(data.frame(
  Linha = c(3,3,1201,1201), 
  Catraca = c(0,37,2,22)
))

和`dplyr`

library(dplyr)
lapply(data.dates, function(x) {
         x %>% group_by(Linha) %>% summarize(catSum = sum(Catraca))
})
# [[1]]
# # A tibble: 2 x 2
#    Linha         catSum
#    <dbl>          <dbl>
# 1     3             37
# 2  1201             24

这将向列表中的每个 data.frame 添加一列，其中包含每个组的总和（按日期和 linha）

带底座 `R`

来自@Sagars 的评论，您也可以在 lapply 中使用 aggregate:

lapply(data.dates, function(x) {
  aggregate(x$Catraca, by = list(Linha = x$Linha), FUN = sum)
})
# [[1]]
#   Linha  x
# 1     3 37
# 2  1201 24

基准测试

事实上，microbenchmark() 显示，在这种情况下，基本解决方案（通常）更快。但是，这仅使用 OP 中给出的小子集进行了测试。

# Unit: microseconds
#   expr      min       lq      mean    median        uq      max neval cld
#  dplyr 1803.553 1878.499 1994.4945 1918.8880 2016.8730 6495.747   100   b
#   base  481.535  513.818  543.4041  538.1365  560.4635  803.222   100  a

Answer 2

您的查询要求 "Catraca" 的总和基于不同的 "Linha"。

aggregate(df$Catraca, by = list(Linha = df$Linha), FUN = sum)

会提供。

非均匀子集的总和

Sum of a non-uniform subset

r

function

subset

subset-sum

和`dplyr`

带底座 `R`

基准测试

非均匀子集的总和

Sum of a non-uniform subset

r

function

subset

subset-sum

和dplyr

带底座 R

基准测试

和`dplyr`

带底座 `R`