用重复数据填充实际数据并去除重复数据

Fill the actual data with duplicated data and remove duplicated data

我有以下数据:-

library(data.table)
data <- data <- data.table(address = c("AA", "BB", "AA", "CC", "DD", "EE", "DD"),
                   revenue = c(NA, 121, 22, 33, 44, 33, NA),
                   castord = c(21, 22, NA, 3, NA, 223, 33),
                   versaze = c(NA, 22, 124, 33, NA, 44, 43))
data

#    address revenue castord versaze
# 1:      AA      NA      21      NA
# 2:      BB     121      22      22
# 3:      AA      22      NA     124
# 4:      CC      33       3      33
# 5:      DD      44      NA      NA
# 6:      EE      33     223      44
# 7:      DD      NA      33      43

现在此数据有 AADD address 重复。正如您在 15 行中看到的那样,它们的第一次出现有一些 NA 数据。我想要做的是使用这些 addresses 的重复行来填充此数据。如果重复的行具有 NA 值,则 NA 不应替换第一次出现的行中的值。这将给我以下数据:-

    data <- data.table(address = c("AA", "BB", "AA", "CC", "DD", "EE", "DD"),
                   revenue = c(22, 121, 22, 33, 44, 33, NA),
                   castord = c(21, 22, NA, 3, 33, 223, 33),
                   versaze = c(124, 22, 124, 33, 43, 44, 43))

#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      AA      22      NA     124
# 4:      CC      33       3      33
# 5:      DD      44      33      43
# 6:      EE      33     223      44
# 7:      DD      NA      33      43

然后删除那些重复的行:-

data <- data.table(address = c("AA", "BB", "CC", "DD", "EE"),
                   revenue = c(22, 121, 33, 44, 33),
                   castord = c(21, 22, 3, 33, 223),
                   versaze = c(124, 22, 33, 43, 44))

#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      CC      33       3      33
# 4:      DD      44      33      43
# 5:      EE      33     223      44

如果您使用 dplyr::group_bysummarise,您使用 na.omit 获取第一个不是 NA 的值,如果第一行是 NA.

data <- data %>% group_by(address) %>% 
  summarise(
    revenue = first(na.omit(revenue)),
    castord = first(na.omit(castord)),
    versaze = first(na.omit(versaze))
    )

使用data.table,按地址分组,遍历列,移除NA,得到第一个值:

data[, lapply(.SD, function(i) (na.omit(i)[ 1 ])), by = address]
#    address revenue castord versaze
# 1:      AA      22      21     124
# 2:      BB     121      22      22
# 3:      CC      33       3      33
# 4:      DD      44      33      43
# 5:      EE      33     223      44