变量创建 - 推断年龄

Variable creation - Inferring age

我有一个分组数据框;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
DF <- data.frame(Truck, OilChanged, Odometer)

# Truck OilChanged Odometer
# 1      A       True     1000
# 2      A     NewOil     1000
# 3      A      False     2000
# 4      A      False     3000
# 5      B      False      700
# 6      B      False      800
# 7      B      False      900
# 8      B      False     1000
# 9      C       True    20000
# 10     C     NewOil    20000
# 11     C       True    30000
# 12     C     NewOil    30000

我正在尽可能地推断石油的年龄(以千米为单位)。只有换油后才能进行推断。如果不更换机油,机油使用年限将仍然是个谜(例如:卡车 B)。

下面是想要的结果;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000, 3000,700,800,900,1000,20000,20000,30000,30000)
OilAge <- c(NA,0,1000,2000,NA,NA,NA,NA,NA,0,10000,0)
Result <- data.frame(Truck, OilChanged, Odometer, OilAge)


# Truck OilChanged Odometer OilAge
# 1      A       True     1000     NA
# 2      A     NewOil     1000      0
# 3      A      False     2000   1000
# 4      A      False     3000   2000
# 5      B      False      700     NA
# 6      B      False      800     NA
# 7      B      False      900     NA
# 8      B      False     1000     NA
# 9      C       True    20000     NA
# 10     C     NewOil    20000      0
# 11     C       True    30000  10000
# 12     C     NewOil    30000      0

注意:True oilchanged 行与后续 NewOil 行之间的里程表读数将始终相同。因为油样是在换油之前直接采集的。但是必须保留这两行以使下游计算正常运行,例如变化率公式。

OilAge 列中的 NA 表示年龄是个谜。

如果此解决方案适合您,请告诉我。

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
DF <- data.frame(Truck, OilChanged, Odometer)

DF %>%
  group_by(Truck) %>%
  mutate(status = length(unique(OilChanged)),
         OilAge = ifelse(OilChanged == "NewOil", 0,
                         ifelse(OilChanged == "False", Odometer - (Odometer - lag(Odometer)),
                                ifelse(OilChanged == "True", Odometer - lag(Odometer), NA)))) %>%
  mutate(OilAge = ifelse(status !=1, OilAge, NA)) %>%
  subset(select = c(Truck, OilChanged, Odometer, OilAge))

另一种方法

DF %>% group_by(Truck)  %>%
  mutate(d = cumsum(OilChanged == 'NewOil')) %>%
  group_by(Truck, d) %>%
  mutate(OilAge = cumsum(c(0*NA^(as.logical(!(first(d)))), diff(NA^(as.logical(!d))*Odometer))))

# A tibble: 12 x 5
# Groups:   Truck, d [6]
   Truck OilChanged Odometer     d OilAge
   <chr> <chr>         <dbl> <int>  <dbl>
 1 A     True           1000     0     NA
 2 A     NewOil         1000     1      0
 3 A     False          2000     1   1000
 4 A     False          3000     1   2000
 5 B     False           700     0     NA
 6 B     False           800     0     NA
 7 B     False           900     0     NA
 8 B     False          1000     0     NA
 9 C     True          20000     0     NA
10 C     NewOil        20000     1      0
11 C     True          30000     1  10000
12 C     NewOil        30000     2      0

d 是一个虚拟变量,您可以在了解已完成的操作后取消选择