计算值(字符串)的累积出现次数,直到出现新字符串且年份连续

Compute cummulative occurences of value (string) until a new string appears and years are continuous

我有这样的数据:

dataset <- data.frame(year = c(2001, 2002, 2003, 2005, 2006))
dataset$firm <- c("A", "A", "B","B","B" )

我想计算该公司出现在数据集中的连续年数。预期的结果是这样的:

dataset <- data.frame(year = c(2001, 2002, 2003, 2005, 2006))
dataset$firm <- c("A", "A", "B","B","B" )
dataset$tenure <- c(1,2,1,1,2)

这里如何获取任期变量? 非常感谢。

使用 tidyverse 你可以做到这一点。它检查年份是否相隔一年并取逻辑结果的累积和。

library(dplyr)
library(tidyr)

dataset %>% 
  group_by(firm) %>% 
  mutate(tenure=(year-1==lag(year))*1, 
    tenure=replace_na(tenure,1), 
    tenure=cumsum(tenure)) %>% 
  ungroup()
# A tibble: 9 × 3
   year firm  tenure
  <dbl> <chr>  <dbl>
1  2001 A          1
2  2002 A          2
3  2003 B          1
4  2005 B          1
5  2006 B          2
6  2007 B          3
7  2008 B          4
8  2010 B          4
9  2011 B          5

扩展数据

dataset <- structure(list(year = c(2001, 2002, 2003, 2005, 2006, 2007, 2008,
2010, 2011), firm = c("A", "A", "B", "B", "B", "B", "B", "B",
"B")), row.names = c(NA, -9L), class = "data.frame")

你可以使用-

library(dplyr)

dataset %>%
  arrange(firm, year) %>%
  group_by(firm, temp = cumsum(c(TRUE, diff(year) > 1))) %>%
  mutate(tenure = row_number()) %>%
  ungroup %>%
  select(-temp)

#   year firm  tenure
#  <dbl> <chr>  <int>
#1  2001 A          1
#2  2002 A          2
#3  2003 B          1
#4  2005 B          1
#5  2006 B          2