dplyr::filter 函数生成错误结果

dplyr::filter function generates wrong results

我正在使用 dplyr::filter 函数根据 3 个变量 SexPatient.AgeCountry.where.Event.occurred 过滤数据,第一个代码部分生成正确的结果,并且第二个代码部分生成错误的结果。但是,从我的角度来看,这两个代码部分具有相同的表达式,所以我很困惑为什么结果不同。

> data
# A tibble: 1,360 × 3
   Sex    Patient.Age Country.where.Event.occurred
   <chr>  <chr>       <chr>                       
 1 Female 12 YR       US                          
 2 Female 16 YR       KW                          
 3 Female 16 YR       US                          
 4 Female 16 YR       US                          
 5 Female 16 YR       US                          
 6 Female 16 YR       US                          
 7 Female 17 YR       ES                          
 8 Female 17 YR       ES                          
 9 Female 17 YR       GB                          
10 Female 19 YR       CA                          
# … with 1,350 more rows

# unique combination of 3 variables
> key <- data %>% 
+   distinct(Sex, Patient.Age,Country.where.Event.occurred)
> key
# A tibble: 399 × 3
   Sex    Patient.Age Country.where.Event.occurred
   <chr>  <chr>       <chr>                       
 1 Female 12 YR       US                          
 2 Female 16 YR       KW                          
 3 Female 16 YR       US                          
 4 Female 17 YR       ES                          
 5 Female 17 YR       GB                          
 6 Female 19 YR       CA                          
 7 Female 19 YR       US                          
 8 Female 2 YR        US                          
 9 Female 26 YR       US                          
10 Female 28 YR       US                          
# … with 389 more rows

> data %>%
+   filter(Sex == key[3,]$Sex,
+          Patient.Age == key[3,]$Patient.Age,
+          Country.where.Event.occurred == key[3,]$Country.where.Event.occurred)
# A tibble: 4 × 3
  Sex    Patient.Age Country.where.Event.occurred
  <chr>  <chr>       <chr>                       
1 Female 16 YR       US                          
2 Female 16 YR       US                          
3 Female 16 YR       US                          
4 Female 16 YR       US 
> Sex <- key[3,]$Sex
> Sex
[1] "Female"
> Age <- key[3,]$Patient.Age
> Age
[1] "16 YR"
> Country <- key[3,]$Country.where.Event.occurred
> Country
[1] "US"
> data %>%
+   filter(Sex == Sex,
+          Patient.Age == Age,
+          Country.where.Event.occurred == Country)
# A tibble: 7 × 3
  Sex    Patient.Age Country.where.Event.occurred
  <chr>  <chr>       <chr>                       
1 Female 16 YR       US                          
2 Female 16 YR       US                          
3 Female 16 YR       US                          
4 Female 16 YR       US                          
5 Male   16 YR       US                          
6 Male   16 YR       US                          
7 Male   16 YR       US         

第二个例子中的问题可能是行 filter(Sex == Sex....

左侧和右侧的术语 Sex 都被解释为数据集中的 Sex 变量。它总是会匹配自身,因此该部分将始终为真。

我认为您打算将左侧设为“女性”(从您的模式和其他两个变量来看。


要更深入地了解这一点,我建议多读几遍 Programming with dplyr 小插图。至少对我来说,每次我 learn/relearn 也有一两个金块。对于您的具体问题,“数据屏蔽”部分是相关的。

The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:

  • env-variables are “programming” variables that live in an environment. They are usually created with <-.

  • data-variables are “statistical” variables that live in a data frame. They usually come from data files (e.g. .csv, .xls), or are created manipulating existing variables.

...

I think this blurring of the meaning of “variable” is a really nice feature...

Unfortunately, this benefit does not come for free...