在r中,如何计算一年内重复值的唯一出现次数?
In r, how to count the number of unique occurrences within a year with repeated values?
对于我的数据框中的每一年,我想计算在那一年观察到的鸟类总数中具有(face.data=="yes")的鸟类的百分比。一个问题是我在同一年内对同一只鸟进行了多次观察。
这是我的数据集:
df <- data.frame(
bird.ID = c(001, 001, 001, 002, 002, 002, 006 ,006, 007, 007, 007, 007),
date = c(2010-04-09, 2013-04-14, 2013-09-14, 2013-05-08, 2013-06-08, 2013-08-08, 2013-04-08, 2013-06-08, 2014-06-08, 2016-06-08, 2017-06-08, 2017-08-08),
face.data = c("yes", "yes", "no","yes", "yes", "no","yes", "yes", "no","yes", "yes", "no")
)
为了获得每年“是”的数量,我尝试了:
aggregate(face.data=="yes" ~ cut(date, "1 year"), data = df, sum)
但是,即使是同一只鸟,这也会将每一行都算作“是”。
理想情况下,最终结果将是一个包含三列的数据框:(i) 年份(例如 2013 年); (ii) 那一年观察到的 Bird.ID 总数,(iii) 今年观察到 face.data=="是" 的唯一 bird.ID 的数量。
像这样:
year number of bird.ID number of face.data
1 2013 10 3
2 2014 15 6
3 2015 20 9
使用data.table
:
dt <- data.table(df)
unique(dt[, .(bird.ID, year = year(date), face.data)])[
, .(`number of bird.ID` = length(unique(bird.ID)),
`number of face.data` = sum(face.data=="yes")),
by=.(year)]
year number of bird.ID number of face.data
1: 2010 1 1
2: 2013 3 3
3: 2014 1 0
4: 2016 1 1
5: 2017 1 1
您可以使用一个小功能快速解决问题:
yes_prop<-function(x)
{
number_of_bird.ID<-length(unique(x$bird.ID)) # number of unique bird.IDs
number_of_face.data<-length(unique(x$bird.ID[x$face.data=="yes"])) # setting "yes", number of unique bird.IDs
data.frame(number_of_bird.ID,number_of_face.data)
}
对于简化日期 data.frame:
df <- data.frame(
bird.ID = c(001, 001, 001, 002, 002, 002, 006 ,006, 007, 007, 007, 007),
date = c(2010, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2014, 2016, 2017, 2017),
face.data = c("yes", "yes", "no","yes", "yes", "no","yes", "yes", "no","yes", "yes", "no")
)
do.call(rbind,by(df,df$date, yes_prop)) # applying function year by year
无论如何,我相信任何其他用户都可以提供更智能的解决方案。
一个dplyr
解决方案:
df %>%
mutate(date = ymd(date),
Year= year(date)) %>%
group_by(Year) %>%
summarise(total_birds = length(unique(bird.ID)),
yes_birds = length(unique(bird.ID[face.data=='yes'])))
输出:
# A tibble: 5 x 3
Year total_birds yes_birds
<dbl> <int> <int>
1 2010 1 1
2 2013 3 3
3 2014 1 0
4 2016 1 1
5 2017 1 1
或 n_distinct()
:
df %>%
mutate(date = ymd(date),
Year= year(date)) %>%
group_by(Year) %>%
summarise(total_birds = n_distinct(bird.ID),
yes_birds = n_distinct(bird.ID[face.data=='yes']))
在 by
方法中计算相应的 lengths
。
首先,一些新鲜的样本数据。
# bird.ID date face.data
# 1 4 2008-01-24 no
# 2 5 2008-05-25 no
# 3 4 2008-07-15 no
# 4 2 2008-08-13 yes
# 5 1 2008-09-15 no
# 6 2 2008-10-25 yes
# 7 1 2008-11-09 yes
# 8 2 2009-02-09 no
# 9 2 2009-04-25 yes
# 10 2 2009-05-18 yes
# 11 5 2009-09-12 no
# 12 4 2009-09-17 no
# 13 1 2009-12-27 yes
# 14 4 2010-04-15 no
# 15 1 2010-05-09 no
# 16 3 2010-07-10 yes
# 17 1 2010-08-02 no
# 18 1 2010-09-08 no
# 19 3 2010-09-10 yes
# 20 1 2010-09-23 no
by(dat, cut(dat$date, "1 year"), \(x)
with(x, c(year=as.integer(strftime(date[[1]], '%Y')),
`number of bird.ID`=length(unique(bird.ID)),
`number of face.data`=length(unique(bird.ID[face.data == 'yes']))))) |>
do.call(what=rbind) |> `rownames<-`(NULL) |> as.data.frame()
# year number of bird.ID number of face.data
# 1 2008 4 2
# 2 2009 4 2
# 3 2010 3 1
数据:
n <- 20
set.seed(42)
dat <- data.frame(bird.ID=sample(1:5, n, replace=TRUE),
date=sample(seq.Date(as.Date('2008-01-01'), as.Date('2011-01-01'), 'day'), n, replace=TRUE),
face.data=sample(c('yes', 'no'), n, replace=TRUE))
对于我的数据框中的每一年,我想计算在那一年观察到的鸟类总数中具有(face.data=="yes")的鸟类的百分比。一个问题是我在同一年内对同一只鸟进行了多次观察。
这是我的数据集:
df <- data.frame(
bird.ID = c(001, 001, 001, 002, 002, 002, 006 ,006, 007, 007, 007, 007),
date = c(2010-04-09, 2013-04-14, 2013-09-14, 2013-05-08, 2013-06-08, 2013-08-08, 2013-04-08, 2013-06-08, 2014-06-08, 2016-06-08, 2017-06-08, 2017-08-08),
face.data = c("yes", "yes", "no","yes", "yes", "no","yes", "yes", "no","yes", "yes", "no")
)
为了获得每年“是”的数量,我尝试了:
aggregate(face.data=="yes" ~ cut(date, "1 year"), data = df, sum)
但是,即使是同一只鸟,这也会将每一行都算作“是”。
理想情况下,最终结果将是一个包含三列的数据框:(i) 年份(例如 2013 年); (ii) 那一年观察到的 Bird.ID 总数,(iii) 今年观察到 face.data=="是" 的唯一 bird.ID 的数量。
像这样:
year number of bird.ID number of face.data
1 2013 10 3
2 2014 15 6
3 2015 20 9
使用data.table
:
dt <- data.table(df)
unique(dt[, .(bird.ID, year = year(date), face.data)])[
, .(`number of bird.ID` = length(unique(bird.ID)),
`number of face.data` = sum(face.data=="yes")),
by=.(year)]
year number of bird.ID number of face.data
1: 2010 1 1
2: 2013 3 3
3: 2014 1 0
4: 2016 1 1
5: 2017 1 1
您可以使用一个小功能快速解决问题:
yes_prop<-function(x)
{
number_of_bird.ID<-length(unique(x$bird.ID)) # number of unique bird.IDs
number_of_face.data<-length(unique(x$bird.ID[x$face.data=="yes"])) # setting "yes", number of unique bird.IDs
data.frame(number_of_bird.ID,number_of_face.data)
}
对于简化日期 data.frame:
df <- data.frame(
bird.ID = c(001, 001, 001, 002, 002, 002, 006 ,006, 007, 007, 007, 007),
date = c(2010, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2014, 2016, 2017, 2017),
face.data = c("yes", "yes", "no","yes", "yes", "no","yes", "yes", "no","yes", "yes", "no")
)
do.call(rbind,by(df,df$date, yes_prop)) # applying function year by year
无论如何,我相信任何其他用户都可以提供更智能的解决方案。
一个dplyr
解决方案:
df %>%
mutate(date = ymd(date),
Year= year(date)) %>%
group_by(Year) %>%
summarise(total_birds = length(unique(bird.ID)),
yes_birds = length(unique(bird.ID[face.data=='yes'])))
输出:
# A tibble: 5 x 3
Year total_birds yes_birds
<dbl> <int> <int>
1 2010 1 1
2 2013 3 3
3 2014 1 0
4 2016 1 1
5 2017 1 1
或 n_distinct()
:
df %>%
mutate(date = ymd(date),
Year= year(date)) %>%
group_by(Year) %>%
summarise(total_birds = n_distinct(bird.ID),
yes_birds = n_distinct(bird.ID[face.data=='yes']))
在 by
方法中计算相应的 lengths
。
首先,一些新鲜的样本数据。
# bird.ID date face.data
# 1 4 2008-01-24 no
# 2 5 2008-05-25 no
# 3 4 2008-07-15 no
# 4 2 2008-08-13 yes
# 5 1 2008-09-15 no
# 6 2 2008-10-25 yes
# 7 1 2008-11-09 yes
# 8 2 2009-02-09 no
# 9 2 2009-04-25 yes
# 10 2 2009-05-18 yes
# 11 5 2009-09-12 no
# 12 4 2009-09-17 no
# 13 1 2009-12-27 yes
# 14 4 2010-04-15 no
# 15 1 2010-05-09 no
# 16 3 2010-07-10 yes
# 17 1 2010-08-02 no
# 18 1 2010-09-08 no
# 19 3 2010-09-10 yes
# 20 1 2010-09-23 no
by(dat, cut(dat$date, "1 year"), \(x)
with(x, c(year=as.integer(strftime(date[[1]], '%Y')),
`number of bird.ID`=length(unique(bird.ID)),
`number of face.data`=length(unique(bird.ID[face.data == 'yes']))))) |>
do.call(what=rbind) |> `rownames<-`(NULL) |> as.data.frame()
# year number of bird.ID number of face.data
# 1 2008 4 2
# 2 2009 4 2
# 3 2010 3 1
数据:
n <- 20
set.seed(42)
dat <- data.frame(bird.ID=sample(1:5, n, replace=TRUE),
date=sample(seq.Date(as.Date('2008-01-01'), as.Date('2011-01-01'), 'day'), n, replace=TRUE),
face.data=sample(c('yes', 'no'), n, replace=TRUE))