函数 na.rv(T)、na.omit、is.finite 等不适用于列的平均值

Question

我正在尝试计算大 df 的平均值，将观察结果除以 Id 和月份，然后 none 我找到的答案按预期工作，有时它们会清空我的样本，这没有用。

如果 df 是：

permno               company        amihud   illiq  MonthYr
10026   J & J SNACK FOODS CORP  1.389026403 1.625   1990-01
10026   J & J SNACK FOODS CORP  1.028968686 NA      1990-01
10026   J & J SNACK FOODS CORP  NA          NA      1990-01
10026   J & J SNACK FOODS CORP  NA          NA      1990-01
10026   J & J SNACK FOODS CORP  Inf         NA      1990-01
10026   J & J SNACK FOODS CORP  Inf         NA      1990-02
10026   J & J SNACK FOODS CORP  0.891034483 NA      1990-02
10397   WERNER ENTERPRISES INC  0.443933917 NA      1990-01
10397   WERNER ENTERPRISES INC  0.255496848 NA      1990-01
10397   WERNER ENTERPRISES INC  0.891034483 NA      1990-02

structure(list(permno = c(10026L, 10026L, 10026L, 10026L, 10026L, 
10026L, 10397L, 10397L, 10397L, 10397L), date = structure(c(5L, 
6L, 1L, 2L, 3L, 4L, 7L, 8L, 9L, 10L), .Label = c("1/10/1990", 
"1/11/1990", "1/12/1990", "1/15/1990", "1/2/1990", "1/3/1990", 
"7/29/1998", "7/30/1998", "8/6/1998", "8/7/1998"), class = "factor"), 
    company = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L), .Label = c("J & J SNACK FOODS CORP", "WERNER ENTERPRISES INC"
    ), class = "factor"), price = c(11.75, 12.75, 13, 13, 12.375, 
    12.75, 12.25, 12.25, 10.75, 11.25), volume = c(36360L, 82710L, 
    22750L, 8574L, 40262L, 10150L, 25200L, 9000L, 333100L, 52200L
    ), amihud = c(1.389026403, 1.028968686, NA, Inf, Inf, 0.891034483, 
    0.255496848, NA, Inf, 0.891034483), illiq = c(1.625240831, 
    NA, NA, NA, NA, NA, NA, NA, NA, NA), MonthYr = structure(c(1L, 
    1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("1990-01", 
    "1990-02"), class = "factor")), .Names = c("permno", "date", 
"company", "price", "volume", "amihud", "illiq", "MonthYr"), class = "data.frame", row.names = c(NA, 
-10L))

我想计算 Amihud 指标（衡量金融流动性不足，因此风险）。简而言之：我需要变量 'amihud' 的每只股票 (permno) 和每月的平均值，我将其称为 'illiq'。

我试过了：

res <- smallcap %>%
        group_by(permno, MonthYr) %>%
        mean(amihud, na.rm=T) %>% 
        group_by(permno)

我不知道这在多大程度上是正确的，但是所有省略或子集化 NA 和 Inf 的尝试都没有成功。

预期结果，不管这个例子的正确性和不需要 amihud 变量：

permno               company    illiq   MonthYr
    10026   J & J SNACK FOODS CORP  1.65    1990-01
    10026   J & J SNACK FOODS CORP  0.87    1990-02
    10397   WERNER ENTERPRISES INC  0.25    1990-01
    10397   WERNER ENTERPRISES INC  0.55    1990-02

感谢您提供的任何提示。

Answer 1

您需要执行以下操作：

#since you don't care about the Infs convert them to NAs
#so that they get removed at the mean function 
#since we have set na.rm=TRUE
df$amihud[df$amihud==Inf] <- NA

library(dplyr)
#you need to use summarise to calculate the means as below:
res <- df %>%
          select(permno, company, MonthYr, amihud) %>%
          group_by(permno, company, MonthYr) %>%
          summarise(illiq = mean(amihud, na.rm=TRUE))

输出：

> res
Source: local data frame [4 x 4]
Groups: permno, company

  permno                company MonthYr     illiq
1  10026 J & J SNACK FOODS CORP 1990-01 1.2089975
2  10026 J & J SNACK FOODS CORP 1990-02 0.8910345
3  10397 WERNER ENTERPRISES INC 1990-01 0.2554968
4  10397 WERNER ENTERPRISES INC 1990-02 0.8910345

P.S。您预期输出中的值可能来自完整集合，因为 10026 J & J SNACK FOODS CORP 1990-02 只有一个值，而且它也应该是平均值，即 0.8910345 而不是 0.87 ，如您的输出。

函数 na.rv(T)、na.omit、is.finite 等不适用于列的平均值

functions na.rv(T), na.omit, is.finite, etc. don't work for the mean of a column

r

mean

na