dplyr 检查每家公司有 6 月的历史记录
dplyr to check each firm has June history
我有一个 1m+ 的数据集,需要检查每个公司(cusip
)和年份(fyear
)是否有 6 月的观察结果,其中 datadate
是 YYYYMMDD。我已经尝试使用 substr()
提取月份并进行了逻辑测试,如果为真则不理会,但如果不是,则 cusip
将被删除。但是,这不起作用,并且返回有关非逻辑参数和条件长度的错误。我已经在 dplyr
之外检查了每一个,以确保一切正常,并且我没有 运行 遇到任何问题,除了在 dplyr
内部。任何帮助将不胜感激。
可重现代码:
tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 2), fyear = c(1962L,
1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1969L, 1970L, 1971L,
1972L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L,
1981L, 1982L, 1983L, 1984L, 1985L, 1962L, 1963L, 1964L, 1965L,
1966L, 1967L, 1969L), datadate = c(19620631L, 19630631L, 19640631L,
19651231L, 19661231L, 19670631L, 19680631L, 19691231L, 19700631L,
19710631L, 19720631L, 19730631L, 19740631L, 19751231L, 19760631L,
19770631L, 19780631L, 19791231L, 19800631L, 19810631L, 19820631L,
19831231L, 19841231L, 19850631L, 19621231L, 19630631L, 19640631L,
19650631L, 19660631L, 19670631L, 19690631L)), .Names = c("cusip", "fyear",
"datadate"), row.names = c(NA, 31L), class = "data.frame")
tdata %>%
group_by(cusip) %>%
group_by(fyear) %>%
arrange(desc(datadate)) %>%
if(substr(datadate[1], 5,6) != 06) cusip <- NULL
错误:
Error in if (.) as.numeric(substr(datadate[1], 5, 6)) != 6 else cusip <- NULL :
argument is not interpretable as logical
In addition: Warning message:
In if (.) as.numeric(substr(datadate[1], 5, 6)) != 6 else cusip <- NULL :
the condition has length > 1 and only the first element will be used
为什么不先为月份创建一个列?类似于:
library(dplyr)
tdata$month <- substr(tdata$datadate, 5, 6)
tdata %>%
group_by(cusip, fyear) %>%
mutate(has_June = month == "06")
请注意,月份是一个字符串,因此要检查是否相等,您需要使用引号。
一气呵成:
tdata %>%
group_by(cusip, fyear) %>%
mutate(month = substr(datadate, 5, 6),
has_June = month == "06")
然后你可以找到没有 June 的那些加上:%>% filter(month != "06")
我有一个 1m+ 的数据集,需要检查每个公司(cusip
)和年份(fyear
)是否有 6 月的观察结果,其中 datadate
是 YYYYMMDD。我已经尝试使用 substr()
提取月份并进行了逻辑测试,如果为真则不理会,但如果不是,则 cusip
将被删除。但是,这不起作用,并且返回有关非逻辑参数和条件长度的错误。我已经在 dplyr
之外检查了每一个,以确保一切正常,并且我没有 运行 遇到任何问题,除了在 dplyr
内部。任何帮助将不胜感激。
可重现代码:
tdata <- structure(list(cusip = c(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 2), fyear = c(1962L,
1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1969L, 1970L, 1971L,
1972L, 1973L, 1974L, 1975L, 1976L, 1977L, 1978L, 1979L, 1980L,
1981L, 1982L, 1983L, 1984L, 1985L, 1962L, 1963L, 1964L, 1965L,
1966L, 1967L, 1969L), datadate = c(19620631L, 19630631L, 19640631L,
19651231L, 19661231L, 19670631L, 19680631L, 19691231L, 19700631L,
19710631L, 19720631L, 19730631L, 19740631L, 19751231L, 19760631L,
19770631L, 19780631L, 19791231L, 19800631L, 19810631L, 19820631L,
19831231L, 19841231L, 19850631L, 19621231L, 19630631L, 19640631L,
19650631L, 19660631L, 19670631L, 19690631L)), .Names = c("cusip", "fyear",
"datadate"), row.names = c(NA, 31L), class = "data.frame")
tdata %>%
group_by(cusip) %>%
group_by(fyear) %>%
arrange(desc(datadate)) %>%
if(substr(datadate[1], 5,6) != 06) cusip <- NULL
错误:
Error in if (.) as.numeric(substr(datadate[1], 5, 6)) != 6 else cusip <- NULL :
argument is not interpretable as logical
In addition: Warning message:
In if (.) as.numeric(substr(datadate[1], 5, 6)) != 6 else cusip <- NULL :
the condition has length > 1 and only the first element will be used
为什么不先为月份创建一个列?类似于:
library(dplyr)
tdata$month <- substr(tdata$datadate, 5, 6)
tdata %>%
group_by(cusip, fyear) %>%
mutate(has_June = month == "06")
请注意,月份是一个字符串,因此要检查是否相等,您需要使用引号。
一气呵成:
tdata %>%
group_by(cusip, fyear) %>%
mutate(month = substr(datadate, 5, 6),
has_June = month == "06")
然后你可以找到没有 June 的那些加上:%>% filter(month != "06")