计算每组每个日期的新值

count new values per date per group

假设我有以下数据集

Date      Group    Value
01-01-19  A        X
01-01-19  A        Y
01-01-19  A        Z
02-01-19  A        X
02-01-19  A        Y
02-01-19  A        Z
02-01-19  A        W
01-01-19  B        X
01-01-19  B        Y
01-01-19  B        Z
02-01-19  B        X
02-01-19  B        X
02-01-19  B        Z
02-01-19  B        V

所以有两组和两个日期。我想查看每个组和每个日期哪些值是新的。

生成的数据框应如下所示

group    date      new_values
A        01-01-19  3 
A        02-01-19  1
B        01-01-19  3
B        02-01-19  1  

最后我只计算了每组每个日期的值的数量并计算了差异。但这并没有考虑自上次日期以来消失的价值。我不知道该怎么做。也许data.table包可以带来release

一种可能:

library(dplyr)

df %>%
  arrange(Date = as.Date(Date, "%d-%m-%y")) %>%
  group_by(Group, Value) %>%
  mutate(New = row_number()) %>%
  group_by(Group, Date) %>%
  summarise(New = sum(New == 1))

输出:

# A tibble: 4 x 3
# Groups:   Group [2]
  Group Date       New
  <fct> <fct>    <int>
1 A     01-01-19     3
2 A     02-01-19     1
3 B     01-01-19     3
4 B     02-01-19     1

以上假设您的日期格式为day-month-year;如果不是这种情况,您只需将 "%d-%m-%y" 更改为 "%m-%d-%y".

使用 dplyr 我们可以先 group_by Group 并创建一个列 (orig) 如果第一次看到它,它将是 TRUE在群里的时间。然后我们 group_by GroupDate 并计算这些原始值的数量。

library(dplyr)

df %>%
  group_by(Group) %>%
  mutate(orig = !duplicated(Value)) %>%
  group_by(Group, Date) %>%
  summarise(new_values = sum(orig))

#  Group     Date     new_values
#   <fct> <fct>         <int>
#1   A     01-01-19          3
#2   A     02-01-19          1
#3   B     01-01-19          3
#4   B     02-01-19          1

rowid 函数统计列组合的出现次数,从 1 开始:

library(data.table)
setDT(DT)

DT[, new := rowid(Group, Value) == 1L]
DT[, .(n_new = sum(new)), by=.(Group, Date)]
#    Group     Date n_new
# 1:     A 01-01-19     3
# 2:     A 02-01-19     1
# 3:     B 01-01-19     3
# 4:     B 02-01-19     1
library(data.table)

dt <- data.table(read.table(text="
01-01-19,A,X
01-01-19,A,Y
01-01-19,A,Z
02-01-19,A,X
02-01-19,A,Y
02-01-19,A,Z
02-01-19,A,W
01-01-19,B,X
01-01-19,B,Y
01-01-19,B,Z
02-01-19,B,X
02-01-19,B,X
02-01-19,B,Z
02-01-19,B,V
",sep=",",strip.white = TRUE))

setnames(dt,c("date","group","value"))

一种解决方案是按组查找唯一值。然后按组和日期对唯一值求和。

##     > dt[,dup:=!duplicated(value),.(group)][,sum(dup),.(group,date)]
## group     date V1
## 1:     A 01-01-19  3
## 2:     A 02-01-19  1
## 3:     B 01-01-19  3
## 4:     B 02-01-19  1