group-filter-select 与 dtplyr 的错误翻译
Incorrect translation of group-filter-select with dtplyr
group-filter-select 很容易用 dplyr 执行。在下面的例子中,我们有一些今年不同季度的公司数据。我现在想过滤到没有第四季度数据的公司的第一季度(在本例中为第二家公司),删除季度标签。
df <- data.frame(companyId = c(rep(1, 4),
rep(2, 3),
rep(3, 4)),
Quarter = c(1:4, 1:3, 1:4),
Year = 2019)
q <- 4
df %>%
group_by(
companyId,
) %>%
filter(
Quarter == 1 &
!(q %in% Quarter)
) %>%
select(companyId,
Year)
> # A tibble: 1 x 3
> # Groups: companyId, Ticker [1]
> companyId Year
> <dbl> <dbl>
> 1 2 2019
然而,对 dtplyr 做同样的事情 returns 一个空 table:
dt <- lazy_dt(data.table(companyId = c(rep(1, 4),
rep(2, 3),
rep(3, 4)),
Quarter = c(1:4, 1:3, 1:4),
Year = 2019))
q <- 4
dt %>%
group_by(
companyId
) %>%
filter(
Quarter == 1 &
!(q %in% Quarter)
) %>%
select(companyId
Year)
> Source: local data table [?? x 3]
> Call: `_DT1`[Quarter == 1 & !(q %in% Quarter), .(companyId,
> Year)]
>
> # ... with 3 variables: companyId <dbl>, Year <dbl>
>
> # Use as.data.table()/as.data.frame()/as_tibble() to access results
奇怪的是显示的翻译:
`_DT1`[Quarter == 1 & !(q %in% Quarter),
.(companyId, Year)]
这是不正确的。如 dtplyr 的 own docs 中所述,正确的调用需要使用过滤后的 .SD
:
`_DT1`[, .SD[Quarter == 1 & !(q %in% Quarter)],
by = .(companyId),
.SDcols = c("Year")]
(自动包含副列,因此 .SDcols
应省略它们以避免重复)
有趣的是,如果我们省略 select
,翻译(因此输出)是正确的:
dt %>%
group_by(
companyId
) %>%
filter(
Quarter == 1 &
!(q %in% Quarter)
)
> Source: local data table [?? x 4]
> Call: `_DT2`[, .SD[Quarter == 1 & !(q %in% Quarter)],
> keyby = .(companyId)]
>
> companyId Quarter Year
> <dbl> <int> <dbl>
> 1 2 1 2019
因此,作为解决方法,我可以在 select
之前执行 as.data.table()
。这有效,但会引发恼人的警告:
dt %>%
group_by(
companyId
) %>%
filter(
calendarQuarter == 1 &
!(q %in% calendarQuarter)
) %>%
as.data.table() %>%
select(companyId,
calendarYear)
> companyId calendarYear
> 1: 2 2019
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data frame implementation,
> and is likely to be inefficient.
> *
> * To suppress this message, either generate a data.table translation with `lazy_dt()` or convert
> * to a data frame or tibble with `as.data.frame()`/`as_tibble()`.
我很难认为这是预期的行为,但我想在将其投放到 dtplyr
Github 跟踪器之前检查这里。
目前这是 dtplyr
中的错误。我已将其发布到 package's Github.
group-filter-select 很容易用 dplyr 执行。在下面的例子中,我们有一些今年不同季度的公司数据。我现在想过滤到没有第四季度数据的公司的第一季度(在本例中为第二家公司),删除季度标签。
df <- data.frame(companyId = c(rep(1, 4),
rep(2, 3),
rep(3, 4)),
Quarter = c(1:4, 1:3, 1:4),
Year = 2019)
q <- 4
df %>%
group_by(
companyId,
) %>%
filter(
Quarter == 1 &
!(q %in% Quarter)
) %>%
select(companyId,
Year)
> # A tibble: 1 x 3
> # Groups: companyId, Ticker [1]
> companyId Year
> <dbl> <dbl>
> 1 2 2019
然而,对 dtplyr 做同样的事情 returns 一个空 table:
dt <- lazy_dt(data.table(companyId = c(rep(1, 4),
rep(2, 3),
rep(3, 4)),
Quarter = c(1:4, 1:3, 1:4),
Year = 2019))
q <- 4
dt %>%
group_by(
companyId
) %>%
filter(
Quarter == 1 &
!(q %in% Quarter)
) %>%
select(companyId
Year)
> Source: local data table [?? x 3]
> Call: `_DT1`[Quarter == 1 & !(q %in% Quarter), .(companyId,
> Year)]
>
> # ... with 3 variables: companyId <dbl>, Year <dbl>
>
> # Use as.data.table()/as.data.frame()/as_tibble() to access results
奇怪的是显示的翻译:
`_DT1`[Quarter == 1 & !(q %in% Quarter),
.(companyId, Year)]
这是不正确的。如 dtplyr 的 own docs 中所述,正确的调用需要使用过滤后的 .SD
:
`_DT1`[, .SD[Quarter == 1 & !(q %in% Quarter)],
by = .(companyId),
.SDcols = c("Year")]
(自动包含副列,因此 .SDcols
应省略它们以避免重复)
有趣的是,如果我们省略 select
,翻译(因此输出)是正确的:
dt %>%
group_by(
companyId
) %>%
filter(
Quarter == 1 &
!(q %in% Quarter)
)
> Source: local data table [?? x 4]
> Call: `_DT2`[, .SD[Quarter == 1 & !(q %in% Quarter)],
> keyby = .(companyId)]
>
> companyId Quarter Year
> <dbl> <int> <dbl>
> 1 2 1 2019
因此,作为解决方法,我可以在 select
之前执行 as.data.table()
。这有效,但会引发恼人的警告:
dt %>%
group_by(
companyId
) %>%
filter(
calendarQuarter == 1 &
!(q %in% calendarQuarter)
) %>%
as.data.table() %>%
select(companyId,
calendarYear)
> companyId calendarYear
> 1: 2 2019
> Warning message:
> You are using a dplyr method on a raw data.table, which will call the data frame implementation,
> and is likely to be inefficient.
> *
> * To suppress this message, either generate a data.table translation with `lazy_dt()` or convert
> * to a data frame or tibble with `as.data.frame()`/`as_tibble()`.
我很难认为这是预期的行为,但我想在将其投放到 dtplyr
Github 跟踪器之前检查这里。
目前这是 dtplyr
中的错误。我已将其发布到 package's Github.