data.frame 基于重复项的条件总和

Conditional sum on data.frame based on duplicates

我一直在尝试根据具有重复项的 data.frame 进行条件求和。我想对具有相同 permno 和日期的那些求和,并创建一个单独的列,并将此信息填入 NA 或更可取的 0。

我的数据集是这样的:

data.frame(crsp)

    permno     date    PAYDT DISTCD divamt FACPR FACSHR   PRC       RET
1   10022 19280929 19281001   1272   0.25     0      0 71.00  0.045208
2   10022 19280929 19281001   1232   1.00     0      0 71.00  0.045208
3   10022 19281031       NA     NA     NA    NA     NA 73.50  0.035211
4   10022 19281130       NA     NA     NA    NA     NA 72.50 -0.013605
5   10022 19281231 19290202   1232   1.00     0      0 68.00 -0.044828
6   10022 19281231 19290202   1272   0.25     0      0 68.00 -0.044828
7   10022 19290131       NA     NA     NA    NA     NA 73.75  0.084559
8   10022 19290228       NA     NA     NA    NA     NA 69.00 -0.064407
9   10022 19290328 19290401   1232   1.00     0      0 65.00 -0.039855
10  10022 19290328 19290401   1272   0.25     0      0 65.00 -0.039855
11  10022 19290430       NA     NA     NA    NA     NA 67.00  0.030769
12  10022 19290531       NA     NA     NA    NA     NA 64.75 -0.033582

首先,我创建了 permno + date 来制作一个唯一的 pickup-code

crsp$permnodate = paste(as.character(crsp$permno),as.character(crsp$date),sep="") 

其次,我尝试对重复项求和并将其制作成一个新框架:

crsp_divsingl <- aggregate(crsp$divamt, by = list(permnodate = crsp$permnodate), FUN = sum, na.rm = TRUE)

但是,我无法将此信息正确传输回原始 data.frame(crsp),因为列的长度不同,其中 cbindcbind.fill 不允许我匹配这是正确的。具体来说,我想要 one/the 第一个唯一 permnodates 的 divamts 总和,因此它与剩余的 data.frame 长度相对应。我也没有成功 mergematch

我还没有尝试过循环函数,也没有设法成功创建任何 ififelse 函数。基本上,这可以在 excel 中使用 VLOOKUP 或 index.match 公式完成,但是,这在 R 中比我最初想象的更棘手。

非常感谢您的帮助。

此致

特鲁尔斯

您可以使用 duplicatedmerge 来更轻松地实现此目的。我写了一个例子。您必须根据自己的目的对此进行更改,但希望它能让您走上正确的轨道:

# Creating a fake sample dataset.
set.seed(9)
permno <- 10022:10071 # Allowing 50 possible permno's. 
date <- 19280929:19280978 # Allow 50 possible dates.
value <- c(NA, 1:9) # Allowing NA or a 0 through 9 value.

# Creating fake data frame.
crsp <- data.frame(permno = sample(permno, 1000, TRUE), date = sample(date, 1000, TRUE), value = sample(value, 1000, TRUE))

# Loading a function that uses duplicated to get both the duplicated rows and the original rows.
fullDup <- function(x) {

  bool <- duplicated(x) | duplicated(x, fromLast = TRUE)
  return(bool)

}

# Getting the duplicated rows.
crsp.dup <- crsp[fullDup(crsp[, c("permno", "date")]), ] # fullDup returns a boolean of all the rows that were duplicated to another row by permno and date including the first row.

# Now aggregate.
crsp.dup[is.na(crsp.dup)] <- 0 # Converting NA values to 0.
crsp.dup <- aggregate(value ~ permno + date, crsp.dup, sum)
names(crsp.dup)[3] <- "value.dup" # Changing the name of the value column.

# Now merge back in with the original dataset.
crsp <- merge(crsp, crsp.dup, by = c("permno", "date"), all.x = TRUE)