计算每组每个日期的新值
count new values per date per group
假设我有以下数据集
Date Group Value
01-01-19 A X
01-01-19 A Y
01-01-19 A Z
02-01-19 A X
02-01-19 A Y
02-01-19 A Z
02-01-19 A W
01-01-19 B X
01-01-19 B Y
01-01-19 B Z
02-01-19 B X
02-01-19 B X
02-01-19 B Z
02-01-19 B V
所以有两组和两个日期。我想查看每个组和每个日期哪些值是新的。
生成的数据框应如下所示
group date new_values
A 01-01-19 3
A 02-01-19 1
B 01-01-19 3
B 02-01-19 1
最后我只计算了每组每个日期的值的数量并计算了差异。但这并没有考虑自上次日期以来消失的价值。我不知道该怎么做。也许data.table
包可以带来release
一种可能:
library(dplyr)
df %>%
arrange(Date = as.Date(Date, "%d-%m-%y")) %>%
group_by(Group, Value) %>%
mutate(New = row_number()) %>%
group_by(Group, Date) %>%
summarise(New = sum(New == 1))
输出:
# A tibble: 4 x 3
# Groups: Group [2]
Group Date New
<fct> <fct> <int>
1 A 01-01-19 3
2 A 02-01-19 1
3 B 01-01-19 3
4 B 02-01-19 1
以上假设您的日期格式为day-month-year
;如果不是这种情况,您只需将 "%d-%m-%y"
更改为 "%m-%d-%y"
.
使用 dplyr
我们可以先 group_by
Group
并创建一个列 (orig
) 如果第一次看到它,它将是 TRUE
在群里的时间。然后我们 group_by
Group
和 Date
并计算这些原始值的数量。
library(dplyr)
df %>%
group_by(Group) %>%
mutate(orig = !duplicated(Value)) %>%
group_by(Group, Date) %>%
summarise(new_values = sum(orig))
# Group Date new_values
# <fct> <fct> <int>
#1 A 01-01-19 3
#2 A 02-01-19 1
#3 B 01-01-19 3
#4 B 02-01-19 1
rowid 函数统计列组合的出现次数,从 1 开始:
library(data.table)
setDT(DT)
DT[, new := rowid(Group, Value) == 1L]
DT[, .(n_new = sum(new)), by=.(Group, Date)]
# Group Date n_new
# 1: A 01-01-19 3
# 2: A 02-01-19 1
# 3: B 01-01-19 3
# 4: B 02-01-19 1
library(data.table)
dt <- data.table(read.table(text="
01-01-19,A,X
01-01-19,A,Y
01-01-19,A,Z
02-01-19,A,X
02-01-19,A,Y
02-01-19,A,Z
02-01-19,A,W
01-01-19,B,X
01-01-19,B,Y
01-01-19,B,Z
02-01-19,B,X
02-01-19,B,X
02-01-19,B,Z
02-01-19,B,V
",sep=",",strip.white = TRUE))
setnames(dt,c("date","group","value"))
一种解决方案是按组查找唯一值。然后按组和日期对唯一值求和。
## > dt[,dup:=!duplicated(value),.(group)][,sum(dup),.(group,date)]
## group date V1
## 1: A 01-01-19 3
## 2: A 02-01-19 1
## 3: B 01-01-19 3
## 4: B 02-01-19 1
假设我有以下数据集
Date Group Value
01-01-19 A X
01-01-19 A Y
01-01-19 A Z
02-01-19 A X
02-01-19 A Y
02-01-19 A Z
02-01-19 A W
01-01-19 B X
01-01-19 B Y
01-01-19 B Z
02-01-19 B X
02-01-19 B X
02-01-19 B Z
02-01-19 B V
所以有两组和两个日期。我想查看每个组和每个日期哪些值是新的。
生成的数据框应如下所示
group date new_values
A 01-01-19 3
A 02-01-19 1
B 01-01-19 3
B 02-01-19 1
最后我只计算了每组每个日期的值的数量并计算了差异。但这并没有考虑自上次日期以来消失的价值。我不知道该怎么做。也许data.table
包可以带来release
一种可能:
library(dplyr)
df %>%
arrange(Date = as.Date(Date, "%d-%m-%y")) %>%
group_by(Group, Value) %>%
mutate(New = row_number()) %>%
group_by(Group, Date) %>%
summarise(New = sum(New == 1))
输出:
# A tibble: 4 x 3
# Groups: Group [2]
Group Date New
<fct> <fct> <int>
1 A 01-01-19 3
2 A 02-01-19 1
3 B 01-01-19 3
4 B 02-01-19 1
以上假设您的日期格式为day-month-year
;如果不是这种情况,您只需将 "%d-%m-%y"
更改为 "%m-%d-%y"
.
使用 dplyr
我们可以先 group_by
Group
并创建一个列 (orig
) 如果第一次看到它,它将是 TRUE
在群里的时间。然后我们 group_by
Group
和 Date
并计算这些原始值的数量。
library(dplyr)
df %>%
group_by(Group) %>%
mutate(orig = !duplicated(Value)) %>%
group_by(Group, Date) %>%
summarise(new_values = sum(orig))
# Group Date new_values
# <fct> <fct> <int>
#1 A 01-01-19 3
#2 A 02-01-19 1
#3 B 01-01-19 3
#4 B 02-01-19 1
rowid 函数统计列组合的出现次数,从 1 开始:
library(data.table)
setDT(DT)
DT[, new := rowid(Group, Value) == 1L]
DT[, .(n_new = sum(new)), by=.(Group, Date)]
# Group Date n_new
# 1: A 01-01-19 3
# 2: A 02-01-19 1
# 3: B 01-01-19 3
# 4: B 02-01-19 1
library(data.table)
dt <- data.table(read.table(text="
01-01-19,A,X
01-01-19,A,Y
01-01-19,A,Z
02-01-19,A,X
02-01-19,A,Y
02-01-19,A,Z
02-01-19,A,W
01-01-19,B,X
01-01-19,B,Y
01-01-19,B,Z
02-01-19,B,X
02-01-19,B,X
02-01-19,B,Z
02-01-19,B,V
",sep=",",strip.white = TRUE))
setnames(dt,c("date","group","value"))
一种解决方案是按组查找唯一值。然后按组和日期对唯一值求和。
## > dt[,dup:=!duplicated(value),.(group)][,sum(dup),.(group,date)]
## group date V1
## 1: A 01-01-19 3
## 2: A 02-01-19 1
## 3: B 01-01-19 3
## 4: B 02-01-19 1