为什么同一个查询使用 dplyr 在不同的 R 会话上返回不同的结果?
Why is the same query returning different results on different R sessions using dplyr?
当我和我的一个同事一起做一个项目时,涉及到使用 tidyverse 的包 dplyr 来操作数据框,我注意到我们的一些结果不同,即使我们使用相同的代码和相同的数据。
来自两个 R 会话的会话信息:
桌面:
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252
[2] LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[4] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[10] sp_1.3-2
RStudio 云
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomNames_1.4-0.0 plotly_4.9.2.1 lubridate_1.7.9
[4] openintro_2.0.0 usdata_0.1.0 cherryblossom_0.1.0
[7] airports_0.1.0 leaflet_2.0.3 forcats_0.5.0
[10] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
[13] readr_1.3.1 tidyr_1.1.0 tibble_3.0.2
[16] ggplot2_3.3.2 tidyverse_1.3.0 shinydashboard_0.7.1
[19] shiny_1.5.0
使用 Iris 的可重现示例:
library(tidyverse)
#lets say that each flower on the data frame iris had a name
iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)
iris_big <- rbind(iris,iris[sample_index,])
我想知道每个物种有多少独特的花,所以我写了以下查询:
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species)
问题是,它returns两个不同的结果,一个在我的桌面上,另一个在我朋友的桌面上(他使用的是 Rstudio Cloud)。
我的桌面:
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Rstudio 云:
Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 83
2 versicolor 80
3 virginica 87
我最终使用以下查询解决了这个问题:
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
select(Species) %>%
group_by(Species) %>%
count()
# A tibble: 3 x 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
但我想知道为什么会这样。
您正在使用 sample
,它使用的是离散均匀分布。
在 R 的 PR#17494 (and associated mailing-list thread) 中,讨论并修复了非均匀采样的问题。这在 R-3.6 中生效。
这可以简单地证明:
R-3.5.3-64位(win10)
set.seed(123) ; sample(5)
# [1] 2 4 5 3 1
R-3.6.1-64位(win10)
set.seed(123) ; sample(5)
# [1] 3 2 5 4 1
R-4.0.2-64位(win10)
set.seed(123) ; sample(5)
# [1] 3 2 5 4 1
在 R-3.6 和更新版本中,您可以 return 到 pre-3.6 采样:
RNGkind(sample.kind = "Rounding")
# Warning in RNGkind(sample.kind = "Rounding") :
# non-uniform 'Rounding' sampler used
set.seed(123) ; sample(5)
# [1] 2 4 5 3 1
我认为你没有得到你认为的那样。考虑:
> unique(iris_big$Species)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
> sum(iris_big$Species == 'setosa')
[1] 83
> sum(iris_big$Species == 'versicolor')
[1] 80
您想减少什么?
(首先,我将此作为备选答案提交,因为我的 (关于 sample.int
在 R-3.5 和 R-3.6 之间的变化)似乎仍然与问题相关“为什么同一个查询在不同的 R 会话中返回不同的结果”;这不是导致 此 症状的原因,但很容易自从你的问题的第一个版本使用 sample
以来,这里的真正罪魁祸首是由于 dplyr 中同样“主要”的版本更改。)
您正在经历 dplyr::count
.
行为的重大变化
在 dplyr-0.8.3 中,?count
表示:
wt: (Optional) If omitted (and no variable named 'n' exists in
the data), will count the number of rows. If specified, will
perform a "weighted" tally by summing the (non-missing)
values of variable 'wt'. A column named 'n' (but not 'nn' or
'nnn') will be used as weighting variable by default in
'tally()', but not in 'count()'. This argument is
automatically quoted and later evaluated in the context of
the data frame. It supports unquoting. See
'vignette("programming")' for an introduction to these
concepts.
在 dplyr-1.0.0 中:
wt: <'data-masking'> Frequency weights. Can be a variable (or
combination of variables) or 'NULL'. 'wt' is computed once
for each unique combination of the counted variables.
• If a variable, 'count()' will compute 'sum(wt)' for each
unique combination.
• If 'NULL', the default, the computation depends on
whether a column of frequency counts 'n' exists in the
data frame. If it exists, the counts are computed with
'sum(n)' for each unique combination. Otherwise, 'n()' is
used to compute the counts. Supply 'wt = n()' to force
this behaviour even if you have an 'n' column in the data
frame.
要看的重要部分是在 0.8.3 中,它说名为 'n' 的 " 列 ... 将在 'tally()' 中使用 ... 但是不在 'count()'" 中。但是,在 1.0.0 中,它不包含该措辞。我使用 R-3.5.3/dplyr-0.8.3 和 R-4.0.2/dplyr-1.0.0 重现了您的结果。
绕过它的方法是以下两种方法之一:
使用count(..., wt=n())
:
R.version$version.string
# [1] "R version 3.5.3 (2019-03-11)"
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species, wt = n())
# # A tibble: 3 x 2
# Species n
# <fct> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
R.version$version.string
# [1] "R version 4.0.2 (2020-06-22)"
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species, wt = n())
# # A tibble: 3 x 2
# Species n
# <fct> <int>
# 1 setosa 50
# 2 versicolor 50
# 3 virginica 50
转为在分组中使用 tally
,如
iris_big %>%
group_by(name,Species) %>%
count() %>%
group_by(Species) %>%
tally()
或者您可以选择另一个选项:
- 意识到这是问题dplyr#5298, which is fixed in the yet-to-be-released dplyr-1.0.1 (I do not know a timeline). With that, the RStudio Cloud user can opt for the github version of dplyr to benefit from dplyr#5349, a PR that has already been merged. This should revert
count
's behavior back to the pre-1.0.0 behavior (despite Hadley's opinion。
当我和我的一个同事一起做一个项目时,涉及到使用 tidyverse 的包 dplyr 来操作数据框,我注意到我们的一些结果不同,即使我们使用相同的代码和相同的数据。
来自两个 R 会话的会话信息:
桌面:
> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=Portuguese_Brazil.1252
[2] LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[4] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[10] sp_1.3-2
RStudio 云
> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS
Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomNames_1.4-0.0 plotly_4.9.2.1 lubridate_1.7.9
[4] openintro_2.0.0 usdata_0.1.0 cherryblossom_0.1.0
[7] airports_0.1.0 leaflet_2.0.3 forcats_0.5.0
[10] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
[13] readr_1.3.1 tidyr_1.1.0 tibble_3.0.2
[16] ggplot2_3.3.2 tidyverse_1.3.0 shinydashboard_0.7.1
[19] shiny_1.5.0
使用 Iris 的可重现示例:
library(tidyverse)
#lets say that each flower on the data frame iris had a name
iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)
iris_big <- rbind(iris,iris[sample_index,])
我想知道每个物种有多少独特的花,所以我写了以下查询:
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species)
问题是,它returns两个不同的结果,一个在我的桌面上,另一个在我朋友的桌面上(他使用的是 Rstudio Cloud)。
我的桌面:
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Rstudio 云:
Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 83
2 versicolor 80
3 virginica 87
我最终使用以下查询解决了这个问题:
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
select(Species) %>%
group_by(Species) %>%
count()
# A tibble: 3 x 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
但我想知道为什么会这样。
您正在使用 sample
,它使用的是离散均匀分布。
在 R 的 PR#17494 (and associated mailing-list thread) 中,讨论并修复了非均匀采样的问题。这在 R-3.6 中生效。
这可以简单地证明:
R-3.5.3-64位(win10)
set.seed(123) ; sample(5) # [1] 2 4 5 3 1
R-3.6.1-64位(win10)
set.seed(123) ; sample(5) # [1] 3 2 5 4 1
R-4.0.2-64位(win10)
set.seed(123) ; sample(5) # [1] 3 2 5 4 1
在 R-3.6 和更新版本中,您可以 return 到 pre-3.6 采样:
RNGkind(sample.kind = "Rounding")
# Warning in RNGkind(sample.kind = "Rounding") :
# non-uniform 'Rounding' sampler used
set.seed(123) ; sample(5)
# [1] 2 4 5 3 1
我认为你没有得到你认为的那样。考虑:
> unique(iris_big$Species)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
> sum(iris_big$Species == 'setosa')
[1] 83
> sum(iris_big$Species == 'versicolor')
[1] 80
您想减少什么?
(首先,我将此作为备选答案提交,因为我的 sample.int
在 R-3.5 和 R-3.6 之间的变化)似乎仍然与问题相关“为什么同一个查询在不同的 R 会话中返回不同的结果”;这不是导致 此 症状的原因,但很容易自从你的问题的第一个版本使用 sample
以来,这里的真正罪魁祸首是由于 dplyr 中同样“主要”的版本更改。)
您正在经历 dplyr::count
.
在 dplyr-0.8.3 中,?count
表示:
wt: (Optional) If omitted (and no variable named 'n' exists in
the data), will count the number of rows. If specified, will
perform a "weighted" tally by summing the (non-missing)
values of variable 'wt'. A column named 'n' (but not 'nn' or
'nnn') will be used as weighting variable by default in
'tally()', but not in 'count()'. This argument is
automatically quoted and later evaluated in the context of
the data frame. It supports unquoting. See
'vignette("programming")' for an introduction to these
concepts.
在 dplyr-1.0.0 中:
wt: <'data-masking'> Frequency weights. Can be a variable (or
combination of variables) or 'NULL'. 'wt' is computed once
for each unique combination of the counted variables.
• If a variable, 'count()' will compute 'sum(wt)' for each
unique combination.
• If 'NULL', the default, the computation depends on
whether a column of frequency counts 'n' exists in the
data frame. If it exists, the counts are computed with
'sum(n)' for each unique combination. Otherwise, 'n()' is
used to compute the counts. Supply 'wt = n()' to force
this behaviour even if you have an 'n' column in the data
frame.
要看的重要部分是在 0.8.3 中,它说名为 'n' 的 " 列 ... 将在 'tally()' 中使用 ... 但是不在 'count()'" 中。但是,在 1.0.0 中,它不包含该措辞。我使用 R-3.5.3/dplyr-0.8.3 和 R-4.0.2/dplyr-1.0.0 重现了您的结果。
绕过它的方法是以下两种方法之一:
使用
count(..., wt=n())
:R.version$version.string # [1] "R version 3.5.3 (2019-03-11)" iris_big %>% group_by(name,Species) %>% count() %>% ungroup() %>% count(Species, wt = n()) # # A tibble: 3 x 2 # Species n # <fct> <int> # 1 setosa 50 # 2 versicolor 50 # 3 virginica 50
R.version$version.string # [1] "R version 4.0.2 (2020-06-22)" iris_big %>% group_by(name,Species) %>% count() %>% ungroup() %>% count(Species, wt = n()) # # A tibble: 3 x 2 # Species n # <fct> <int> # 1 setosa 50 # 2 versicolor 50 # 3 virginica 50
转为在分组中使用
tally
,如iris_big %>% group_by(name,Species) %>% count() %>% group_by(Species) %>% tally()
或者您可以选择另一个选项:
- 意识到这是问题dplyr#5298, which is fixed in the yet-to-be-released dplyr-1.0.1 (I do not know a timeline). With that, the RStudio Cloud user can opt for the github version of dplyr to benefit from dplyr#5349, a PR that has already been merged. This should revert
count
's behavior back to the pre-1.0.0 behavior (despite Hadley's opinion。