dplyr::filter 函数生成错误结果
dplyr::filter function generates wrong results
我正在使用 dplyr::filter
函数根据 3 个变量 Sex
、Patient.Age
、Country.where.Event.occurred
过滤数据,第一个代码部分生成正确的结果,并且第二个代码部分生成错误的结果。但是,从我的角度来看,这两个代码部分具有相同的表达式,所以我很困惑为什么结果不同。
> data
# A tibble: 1,360 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 12 YR US
2 Female 16 YR KW
3 Female 16 YR US
4 Female 16 YR US
5 Female 16 YR US
6 Female 16 YR US
7 Female 17 YR ES
8 Female 17 YR ES
9 Female 17 YR GB
10 Female 19 YR CA
# … with 1,350 more rows
# unique combination of 3 variables
> key <- data %>%
+ distinct(Sex, Patient.Age,Country.where.Event.occurred)
> key
# A tibble: 399 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 12 YR US
2 Female 16 YR KW
3 Female 16 YR US
4 Female 17 YR ES
5 Female 17 YR GB
6 Female 19 YR CA
7 Female 19 YR US
8 Female 2 YR US
9 Female 26 YR US
10 Female 28 YR US
# … with 389 more rows
> data %>%
+ filter(Sex == key[3,]$Sex,
+ Patient.Age == key[3,]$Patient.Age,
+ Country.where.Event.occurred == key[3,]$Country.where.Event.occurred)
# A tibble: 4 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 16 YR US
2 Female 16 YR US
3 Female 16 YR US
4 Female 16 YR US
> Sex <- key[3,]$Sex
> Sex
[1] "Female"
> Age <- key[3,]$Patient.Age
> Age
[1] "16 YR"
> Country <- key[3,]$Country.where.Event.occurred
> Country
[1] "US"
> data %>%
+ filter(Sex == Sex,
+ Patient.Age == Age,
+ Country.where.Event.occurred == Country)
# A tibble: 7 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 16 YR US
2 Female 16 YR US
3 Female 16 YR US
4 Female 16 YR US
5 Male 16 YR US
6 Male 16 YR US
7 Male 16 YR US
第二个例子中的问题可能是行 filter(Sex == Sex...
.
左侧和右侧的术语 Sex
都被解释为数据集中的 Sex
变量。它总是会匹配自身,因此该部分将始终为真。
我认为您打算将左侧设为“女性”(从您的模式和其他两个变量来看。
要更深入地了解这一点,我建议多读几遍 Programming with dplyr 小插图。至少对我来说,每次我 learn/relearn 也有一两个金块。对于您的具体问题,“数据屏蔽”部分是相关的。
The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:
env-variables are “programming” variables that live in an environment. They are usually created with <-.
data-variables are “statistical” variables that live in a data frame. They usually come from data files (e.g. .csv, .xls), or are created manipulating existing variables.
...
I think this blurring of the meaning of “variable” is a really nice feature...
Unfortunately, this benefit does not come for free...
我正在使用 dplyr::filter
函数根据 3 个变量 Sex
、Patient.Age
、Country.where.Event.occurred
过滤数据,第一个代码部分生成正确的结果,并且第二个代码部分生成错误的结果。但是,从我的角度来看,这两个代码部分具有相同的表达式,所以我很困惑为什么结果不同。
> data
# A tibble: 1,360 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 12 YR US
2 Female 16 YR KW
3 Female 16 YR US
4 Female 16 YR US
5 Female 16 YR US
6 Female 16 YR US
7 Female 17 YR ES
8 Female 17 YR ES
9 Female 17 YR GB
10 Female 19 YR CA
# … with 1,350 more rows
# unique combination of 3 variables
> key <- data %>%
+ distinct(Sex, Patient.Age,Country.where.Event.occurred)
> key
# A tibble: 399 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 12 YR US
2 Female 16 YR KW
3 Female 16 YR US
4 Female 17 YR ES
5 Female 17 YR GB
6 Female 19 YR CA
7 Female 19 YR US
8 Female 2 YR US
9 Female 26 YR US
10 Female 28 YR US
# … with 389 more rows
> data %>%
+ filter(Sex == key[3,]$Sex,
+ Patient.Age == key[3,]$Patient.Age,
+ Country.where.Event.occurred == key[3,]$Country.where.Event.occurred)
# A tibble: 4 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 16 YR US
2 Female 16 YR US
3 Female 16 YR US
4 Female 16 YR US
> Sex <- key[3,]$Sex
> Sex
[1] "Female"
> Age <- key[3,]$Patient.Age
> Age
[1] "16 YR"
> Country <- key[3,]$Country.where.Event.occurred
> Country
[1] "US"
> data %>%
+ filter(Sex == Sex,
+ Patient.Age == Age,
+ Country.where.Event.occurred == Country)
# A tibble: 7 × 3
Sex Patient.Age Country.where.Event.occurred
<chr> <chr> <chr>
1 Female 16 YR US
2 Female 16 YR US
3 Female 16 YR US
4 Female 16 YR US
5 Male 16 YR US
6 Male 16 YR US
7 Male 16 YR US
第二个例子中的问题可能是行 filter(Sex == Sex...
.
左侧和右侧的术语 Sex
都被解释为数据集中的 Sex
变量。它总是会匹配自身,因此该部分将始终为真。
我认为您打算将左侧设为“女性”(从您的模式和其他两个变量来看。
要更深入地了解这一点,我建议多读几遍 Programming with dplyr 小插图。至少对我来说,每次我 learn/relearn 也有一两个金块。对于您的具体问题,“数据屏蔽”部分是相关的。
The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:
env-variables are “programming” variables that live in an environment. They are usually created with <-.
data-variables are “statistical” variables that live in a data frame. They usually come from data files (e.g. .csv, .xls), or are created manipulating existing variables.
...
I think this blurring of the meaning of “variable” is a really nice feature...
Unfortunately, this benefit does not come for free...