在 R 中每月计算唯一值

Count unique values per month in R

我有一个数据集,其中包含来自现场观察员的死鸟记录。

Death.Date Observer Species Bird.ID
1 03/08/2021       DA      MF FC10682
2 15/08/2021       AG      MF FC10698
3 12/01/2022       DA      MF FC20957
4 09/02/2022       DA      MF FC10708

我想从中生成一个数据集,其中包含唯一 Bird.ID / Month 的数量,这样我就可以从中生成一个图表。 (“独特”是因为有些人犯错,有时会输入两次鸟)。

这种情况下的输出将是:

Month Number of dead 
08/2021 2
01/2022 1
02/2022 1

想法是使用 distinct 函数,但按月(知道值是日期格式 dd/mm/yyyy)。

可能的解决方案,基于 tidyverselubridatezoo::as.yearmon

library(tidyverse)
library(lubridate)
library(zoo)

df <- data.frame(
  Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
  Observer = c("DA", "AG", "DA", "DA"),
  Species = c("MF", "MF", "MF", "MF"),
  Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)

df %>% 
  group_by(date = as.yearmon(dmy(Death.Date))) %>% 
  summarise(nDead = n_distinct(Bird.ID), .groups = "drop")

#> # A tibble: 3 x 2
#>   date      nDead
#>   <yearmon> <int>
#> 1 Aug 2021      2
#> 2 Jan 2022      1
#> 3 Feb 2022      1

data.table解决方案

library(data.table)
library(lubridate)

# Reproductible example with a duplicated bird
deadbirds <- data.table::data.table(Death.Date = c("03/08/2021", "15/08/2021", "12/01/2022", "09/02/2022", "03/08/2021"),
                                    Observer   = c("DA", "AG", "DA", "DA", "DA"),
                                    Species    = c("MF", "MF", "MF" , "MF", "MF"),
                                    Bird.ID    = c("FC10682", "FC10698", "FC20957", "FC10708", "FC10682"))

# Clean dataset = option 1 : delete all duplicated row
deadbirds <- base::unique(deadbirds)

# Clean dataset = option 2 : keep only the first line by bird (can be useful when there is duplicated data with differents values in useless columns)
deadbirds <- deadbirds[
  j = .SD[1],
  by = c("Bird.ID")
]

# Death.Date as date
deadbirds <- deadbirds[
  j = Death.Date := lubridate::dmy(Death.Date)
]

# Create month.Death.Date
deadbirds <- deadbirds[
  j = month.Death.Date := base::paste0(lubridate::month(Death.Date),
                                       "/",
                                       lubridate::year(Death.Date))
]

# Count by month
deadbirds <- deadbirds[
  j  = `Number of dead` := .N,
  by = month.Death.Date]
  1. 如果您的日期列是字符类型,请先使用 dmy

    转换为日期类型
  2. 将格式更改为年月

  3. group_bysummarize

library(dplyr)
library(lubridate) # in case your Date is in character format

df %>% 
  mutate(Death.Date = dmy(Death.Date)) %>% # you may not need this line
  mutate(Month = format(as.Date(Death.Date), "%m/%Y")) %>%
  group_by(Month) %>% 
  summarise(`Number of dead`=n())
 Month   `Number of dead`
  <chr>              <int>
1 01/2022                1
2 02/2022                1
3 08/2021                2

您可以使用:

as.data.frame(table(format(as.Date(df$Death.Date,'%d/%m/%Y'), '%m/%Y')))

#       Var1 Freq
# 1 01/2022    1
# 2 02/2022    1
# 3 08/2021    2

数据:

df <- data.frame(
  Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
  Observer = c("DA", "AG", "DA", "DA"),
  Species = c("MF", "MF", "MF", "MF"),
  Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)

为了完整起见,这可以使用 aggregate 实现,无需任何额外的包:

df <- data.frame(
  Death.Date = c("3/8/2021", "15/08/2021", "12/1/2022", "9/2/2022"),
  Observer = c("DA", "AG", "DA", "DA"),
  Species = c("MF", "MF", "MF", "MF"),
  Bird.ID = c("FC10682", "FC10698", "FC20957", "FC10708")
)


aggregate.data.frame(
  x = df["Bird.ID"],
  by = list(death_month = format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")),
  FUN = function(x) {length(unique(x))}
)

备注

  • 匿名函数function(x) {length(unique(x))提供唯一值的计数
  • format(as.Date(df$Death.Date, "%d/%m/%Y"), "%m/%Y")) 调用确保提供 month/Year 字符串