为什么同一个查询使用 dplyr 在不同的 R 会话上返回不同的结果?

Why is the same query returning different results on different R sessions using dplyr?

当我和我的一个同事一起做一个项目时,涉及到使用 tidyverse 的包 dplyr 来操作数据框,我注意到我们的一些结果不同,即使我们使用相同的代码和相同的数据。

来自两个 R 会话的会话信息:

桌面:

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252 
[2] LC_CTYPE=Portuguese_Brazil.1252   
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C                      
[5] LC_TIME=Portuguese_Brazil.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3    
 [4] purrr_0.3.3     readr_1.3.1     tidyr_1.0.0    
 [7] tibble_2.1.3    ggplot2_3.2.1   tidyverse_1.3.0
[10] sp_1.3-2      

RStudio 云

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] randomNames_1.4-0.0  plotly_4.9.2.1       lubridate_1.7.9     
 [4] openintro_2.0.0      usdata_0.1.0         cherryblossom_0.1.0 
 [7] airports_0.1.0       leaflet_2.0.3        forcats_0.5.0       
[10] stringr_1.4.0        dplyr_1.0.0          purrr_0.3.4         
[13] readr_1.3.1          tidyr_1.1.0          tibble_3.0.2        
[16] ggplot2_3.3.2        tidyverse_1.3.0      shinydashboard_0.7.1
[19] shiny_1.5.0         

使用 Iris 的可重现示例:


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")
  

#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

我想知道每个物种有多少独特的花,所以我写了以下查询:

 
iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>% 
  count(Species)

问题是,它returns两个不同的结果,一个在我的桌面上,另一个在我朋友的桌面上(他使用的是 Rstudio Cloud)。

我的桌面:

# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

Rstudio 云:


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
  Species        n
  <fct>      <int>
1 setosa        83
2 versicolor    80
3 virginica     87

我最终使用以下查询解决了这个问题:

iris_big %>% 
  group_by(name,Species) %>% 
  count() %>% 
  ungroup() %>%
  select(Species) %>% 
  group_by(Species) %>% 
  count()

# A tibble: 3 x 2
# Groups:   Species [3]
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50

但我想知道为什么会这样。

您正在使用 sample,它使用的是离散均匀分布。

在 R 的 PR#17494 (and associated mailing-list thread) 中,讨论并修复了非均匀采样的问题。这在 R-3.6 中生效。

这可以简单地证明:

  • R-3.5.3-64位(win10)

    set.seed(123) ; sample(5)
    # [1] 2 4 5 3 1
    
  • R-3.6.1-64位(win10)

    set.seed(123) ; sample(5)
    # [1] 3 2 5 4 1
    
  • R-4.0.2-64位(win10)

    set.seed(123) ; sample(5)
    # [1] 3 2 5 4 1
    

在 R-3.6 和更新版本中,您可以 return 到 pre-3.6 采样:

RNGkind(sample.kind = "Rounding")
# Warning in RNGkind(sample.kind = "Rounding") :
#   non-uniform 'Rounding' sampler used
set.seed(123) ; sample(5)
# [1] 2 4 5 3 1

我认为你没有得到你认为的那样。考虑:

> unique(iris_big$Species)
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica
> sum(iris_big$Species == 'setosa')
[1] 83
> sum(iris_big$Species == 'versicolor')
[1] 80

您想减少什么?

(首先,我将此作为备选答案提交,因为我的 (关于 sample.int 在 R-3.5 和 R-3.6 之间的变化)似乎仍然与问题相关“为什么同一个查询在不同的 R 会话中返回不同的结果”;这不是导致 症状的原因,但很容易自从你的问题的第一个版本使用 sample 以来,这里的真正罪魁祸首是由于 dplyr 中同样“主要”的版本更改。)

您正在经历 dplyr::count.

行为的重大变化

在 dplyr-0.8.3 中,?count 表示:

      wt: (Optional) If omitted (and no variable named 'n' exists in
          the data), will count the number of rows. If specified, will
          perform a "weighted" tally by summing the (non-missing)
          values of variable 'wt'. A column named 'n' (but not 'nn' or
          'nnn') will be used as weighting variable by default in
          'tally()', but not in 'count()'. This argument is
          automatically quoted and later evaluated in the context of
          the data frame. It supports unquoting. See
          'vignette("programming")' for an introduction to these
          concepts.

在 dplyr-1.0.0 中:

      wt: <'data-masking'> Frequency weights. Can be a variable (or
          combination of variables) or 'NULL'. 'wt' is computed once
          for each unique combination of the counted variables.

            • If a variable, 'count()' will compute 'sum(wt)' for each
              unique combination.

            • If 'NULL', the default, the computation depends on
              whether a column of frequency counts 'n' exists in the
              data frame. If it exists, the counts are computed with
              'sum(n)' for each unique combination. Otherwise, 'n()' is
              used to compute the counts. Supply 'wt = n()' to force
              this behaviour even if you have an 'n' column in the data
              frame.

要看的重要部分是在 0.8.3 中,它说名为 'n' 的 " 列 ... 将在 'tally()' 中使用 ... 但是不在 'count()'" 中。但是,在 1.0.0 中,它不包含该措辞。我使用 R-3.5.3/dplyr-0.8.3 和 R-4.0.2/dplyr-1.0.0 重现了您的结果。

绕过它的方法是以下两种方法之一:

  1. 使用count(..., wt=n()):

    R.version$version.string
    # [1] "R version 3.5.3 (2019-03-11)"
    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      ungroup() %>%
      count(Species, wt = n())
    # # A tibble: 3 x 2
    #   Species        n
    #   <fct>      <int>
    # 1 setosa        50
    # 2 versicolor    50
    # 3 virginica     50
    
    R.version$version.string
    # [1] "R version 4.0.2 (2020-06-22)"
    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      ungroup() %>%
      count(Species, wt = n())
    # # A tibble: 3 x 2
    #   Species        n
    #   <fct>      <int>
    # 1 setosa        50
    # 2 versicolor    50
    # 3 virginica     50
    
  2. 转为在分组中使用 tally,如

    iris_big %>%
      group_by(name,Species) %>%
      count() %>%
      group_by(Species) %>%
      tally()
    

或者您可以选择另一个选项:

  1. 意识到这是问题dplyr#5298, which is fixed in the yet-to-be-released dplyr-1.0.1 (I do not know a timeline). With that, the RStudio Cloud user can opt for the github version of dplyr to benefit from dplyr#5349, a PR that has already been merged. This should revert count's behavior back to the pre-1.0.0 behavior (despite Hadley's opinion