根据 data.table 中另一列的值填充一列

Filling a column based on the value of another column in data.table

我有如下数据:

dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A", 
"Area B"), class = "factor"), type= c("cat_1", "cat_1", 
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", 
"cat_1", NA, "cat_2", NA, NA, "cat_2", "cat_2", "cat_2", "cat_2", 
"cat_2")), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L))

    amount_of_categories municipality  type
 1:                    2       Area A cat_1
 2:                    2       Area A cat_1
 3:                    2       Area A cat_1
 4:                    2       Area A cat_1
 5:                    2       Area A cat_1
 6:                    2       Area A cat_1
 7:                    2       Area A cat_1
 8:                    2       Area A cat_1
 9:                    2       Area A cat_1
10:                    2       Area A cat_1
11:                    2       Area A cat_1
12:                    2       Area A  <NA>
13:                    2       Area A cat_2
14:                    1       Area B  <NA>
15:                    1       Area B  <NA>
16:                    1       Area B cat_2
17:                    1       Area B cat_2
18:                    1       Area B cat_2
19:                    1       Area B cat_2
20:                    1       Area B cat_2

想法是创建一个新列 type_estimation,用正确的类型替换 type 列中的 NA。如果那个Area只有一个类别(amount_of_categories==1),才能建立正确的类型。所以它应该填充最后两个 NA 而不是第一个

我试过了:

dat <- setDT(dat)[is.na(type) & amount_of_categories==1, type_estimation:= shift(type), by="municipality"]

但这不起作用。这里的正确语法是什么?

期望的结果:

    amount_of_categories municipality  type  type_estimation
 1:                    2       Area A cat_1            cat_1
 2:                    2       Area A cat_1            cat_1
 3:                    2       Area A cat_1            cat_1
 4:                    2       Area A cat_1            cat_1
 5:                    2       Area A cat_1            cat_1
 6:                    2       Area A cat_1            cat_1
 7:                    2       Area A cat_1            cat_1
 8:                    2       Area A cat_1            cat_1
 9:                    2       Area A cat_1            cat_1
10:                    2       Area A cat_1            cat_1
11:                    2       Area A cat_1            cat_1
12:                    2       Area A  <NA>             <NA> 
13:                    2       Area A cat_2            cat_2
14:                    1       Area B  <NA>            cat_2
15:                    1       Area B  <NA>            cat_2
16:                    1       Area B cat_2            cat_2
17:                    1       Area B cat_2            cat_2
18:                    1       Area B cat_2            cat_2
19:                    1       Area B cat_2            cat_2
20:                    1       Area B cat_2            cat_2

编辑:

我试图想出一种情况,在这种情况下,Waldi 提供的解决方案可能会导致问题。在考虑了一下之后,我意识到如果是这样的话:

  1. dat[,estimation:=zoo::na.locf(type)]填错了类型,因为最后一个观察是Area A被结转,到第一个观察Area B
  2. Area B 只有一个类别,所以 [amount_of_categories!=1&is.na(type) ,estimation:=NA][] 确实使这个值 NA.

示例数据中:

dat <- structure(list(amount_of_categories = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), municipality = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Area A", 
"Area B"), class = "factor"), type= c("cat_1", "cat_1", 
"cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", "cat_1", 
"cat_1", NA, "cat_2", NA, NA, "cat_3", "cat_3", "cat_3", "cat_3", 
"cat_3")), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L))

   amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_2
15:                    1       Area B  <NA>      cat_2
16:                    1       Area B cat_3      cat_3
17:                    1       Area B cat_3      cat_3
18:                    1       Area B cat_3      cat_3
19:                    1       Area B cat_3      cat_3
20:                    1       Area B cat_3      cat_3

正如 Waldi 已经指出的,这个问题不能通过使用来解决:

dat[,estimation:=zoo::na.locf(type), by="municipality"][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

如能解决此问题,我们将不胜感激。

分两步:

dat[,estimation:=zoo::na.locf(type)][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

    amount_of_categories municipality   type estimation
                   <int>       <fctr> <char>     <char>
 1:                    2       Area A  cat_1      cat_1
 2:                    2       Area A  cat_1      cat_1
 3:                    2       Area A  cat_1      cat_1
 4:                    2       Area A  cat_1      cat_1
 5:                    2       Area A  cat_1      cat_1
 6:                    2       Area A  cat_1      cat_1
 7:                    2       Area A  cat_1      cat_1
 8:                    2       Area A  cat_1      cat_1
 9:                    2       Area A  cat_1      cat_1
10:                    2       Area A  cat_1      cat_1
11:                    2       Area A  cat_1      cat_1
12:                    2       Area A   <NA>       <NA>
13:                    2       Area A  cat_2      cat_2
14:                    1       Area B   <NA>      cat_2
15:                    1       Area B   <NA>      cat_2
16:                    1       Area B  cat_2      cat_2
17:                    1       Area B  cat_2      cat_2
18:                    1       Area B  cat_2      cat_2
19:                    1       Area B  cat_2      cat_2
20:                    1       Area B  cat_2      cat_2
    amount_of_categories municipality   type estimation

请注意,我使用了 zoo::na.locf,因为 data.table::nafill(type='locf') 还不能处理字符。

municipality 的替代方法 na.fill 在您编辑后(示例 2):

dat[,estimation:=zoo::na.fill(type,fill=type[which.max(!is.na(type))]),by=municipality][amount_of_categories!=1&is.na(type) ,estimation:=NA][]

   amount_of_categories municipality   type estimation
                   <int>       <fctr> <char>     <char>
 1:                    2       Area A  cat_1      cat_1
 2:                    2       Area A  cat_1      cat_1
 3:                    2       Area A  cat_1      cat_1
 4:                    2       Area A  cat_1      cat_1
 5:                    2       Area A  cat_1      cat_1
 6:                    2       Area A  cat_1      cat_1
 7:                    2       Area A  cat_1      cat_1
 8:                    2       Area A  cat_1      cat_1
 9:                    2       Area A  cat_1      cat_1
10:                    2       Area A  cat_1      cat_1
11:                    2       Area A  cat_1      cat_1
12:                    2       Area A   <NA>       <NA>
13:                    2       Area A  cat_2      cat_2
14:                    1       Area B   <NA>      cat_3
15:                    1       Area B   <NA>      cat_3
16:                    1       Area B  cat_3      cat_3
17:                    1       Area B  cat_3      cat_3
18:                    1       Area B  cat_3      cat_3
19:                    1       Area B  cat_3      cat_3
20:                    1       Area B  cat_3      cat_3
    amount_of_categories municipality   type estimation

这种利用 unique() 和加入的方法是否对这两种情况都有帮助?

unique(
  dat[amount_of_categories==1 & !is.na(type), .(municipality,type_estimation=type)]
)[dat, on=.(municipality)][is.na(type_estimation),type_estimation:=type][]

示例 1 的输出:

    municipality type_estimation amount_of_categories   type
          <fctr>          <char>                <int> <char>
 1:       Area A           cat_1                    2  cat_1
 2:       Area A           cat_1                    2  cat_1
 3:       Area A           cat_1                    2  cat_1
 4:       Area A           cat_1                    2  cat_1
 5:       Area A           cat_1                    2  cat_1
 6:       Area A           cat_1                    2  cat_1
 7:       Area A           cat_1                    2  cat_1
 8:       Area A           cat_1                    2  cat_1
 9:       Area A           cat_1                    2  cat_1
10:       Area A           cat_1                    2  cat_1
11:       Area A           cat_1                    2  cat_1
12:       Area A            <NA>                    2   <NA>
13:       Area A           cat_2                    2  cat_2
14:       Area B           cat_2                    1   <NA>
15:       Area B           cat_2                    1   <NA>
16:       Area B           cat_2                    1  cat_2
17:       Area B           cat_2                    1  cat_2
18:       Area B           cat_2                    1  cat_2
19:       Area B           cat_2                    1  cat_2
20:       Area B           cat_2                    1  cat_2

示例 2 的输出:

    municipality type_estimation amount_of_categories   type
          <fctr>          <char>                <int> <char>
 1:       Area A           cat_1                    2  cat_1
 2:       Area A           cat_1                    2  cat_1
 3:       Area A           cat_1                    2  cat_1
 4:       Area A           cat_1                    2  cat_1
 5:       Area A           cat_1                    2  cat_1
 6:       Area A           cat_1                    2  cat_1
 7:       Area A           cat_1                    2  cat_1
 8:       Area A           cat_1                    2  cat_1
 9:       Area A           cat_1                    2  cat_1
10:       Area A           cat_1                    2  cat_1
11:       Area A           cat_1                    2  cat_1
12:       Area A            <NA>                    2   <NA>
13:       Area A           cat_2                    2  cat_2
14:       Area B           cat_3                    1   <NA>
15:       Area B           cat_3                    1   <NA>
16:       Area B           cat_3                    1  cat_3
17:       Area B           cat_3                    1  cat_3
18:       Area B           cat_3                    1  cat_3
19:       Area B           cat_3                    1  cat_3
20:       Area B           cat_3                    1  cat_3

另一种方法是为 update join[=] 中使用的相关案例创建 look-up table 52=]:

library(data.table)
lut <- setDT(dat)[amount_of_categories == 1, first(na.omit(type)), by = municipality]
dat[, estimation := type][lut, on = .(municipality), estimation := V1][]

示例 1 的结果

    amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_2
15:                    1       Area B  <NA>      cat_2
16:                    1       Area B cat_2      cat_2
17:                    1       Area B cat_2      cat_2
18:                    1       Area B cat_2      cat_2
19:                    1       Area B cat_2      cat_2
20:                    1       Area B cat_2      cat_2

示例 2 的结果

    amount_of_categories municipality  type estimation
 1:                    2       Area A cat_1      cat_1
 2:                    2       Area A cat_1      cat_1
 3:                    2       Area A cat_1      cat_1
 4:                    2       Area A cat_1      cat_1
 5:                    2       Area A cat_1      cat_1
 6:                    2       Area A cat_1      cat_1
 7:                    2       Area A cat_1      cat_1
 8:                    2       Area A cat_1      cat_1
 9:                    2       Area A cat_1      cat_1
10:                    2       Area A cat_1      cat_1
11:                    2       Area A cat_1      cat_1
12:                    2       Area A  <NA>       <NA>
13:                    2       Area A cat_2      cat_2
14:                    1       Area B  <NA>      cat_3
15:                    1       Area B  <NA>      cat_3
16:                    1       Area B cat_3      cat_3
17:                    1       Area B cat_3      cat_3
18:                    1       Area B cat_3      cat_3
19:                    1       Area B cat_3      cat_3
20:                    1       Area B cat_3      cat_3

说明

  1. 对于每个只有一个类别municipalitytype的第一个non-NA元素被选为look-up table lut.
  2. dat 中创建了一个新列 estimate 作为 type 的完整副本。
  3. 更新连接中,estimate中的所有条目都被lut中的值替换仅用于匹配[=52] =] municipality.

这种方法在某种程度上类似于 ,但在实现细节上有所不同。

N.B.: 直接更新列type

OP 已请求创建一个单独的列 estimate。但是,可以直接更新列类型,从而简化代码:

dat[lut, on = .(municipality), type := V1][]