Select 整个组,如果一个选择具有给定值
Select entire group if one selection has the given value
datetime label option_title option_value lead difference
1 2016-07-22 GE 3 - Commercial Review 3 2 -1
2 2017-02-20 GE 2 - Solution Review 2 1 -1
3 2017-02-20 GE 1 - Opportunity Review 1 2 1
4 2017-04-18 GE 2 - Solution Review 2 3 1
5 2017-04-19 GE 3 - Commercial Review 3 4 1
6 2017-04-19 GE 4 - Submit Proposal 4 5 1
7 2017-08-08 GE 5 - Proposal Awarded 5 NA NA
8 2016-08-02 HSBC 5 - Proposal Awarded 5 6 1
9 2016-12-13 HSBC 6 - Delivery Phase 1 6 7 1
10 2017-08-07 HSBC 7 - Phase 1 Live 7 NA NA
11 2016-07-22 Lowes Pre-Qualification 0 NA NA
12 2016-08-02 Danske Bank 6 - Delivery Phase 1 6 NA NA
13 2016-07-22 AP Moller Maersk (IT Transformation) 3 - Commercial Review 3 NA NA
14 2016-07-22 BHP Billiton - APJ Pre-Qualification 0 2 2
15 2016-07-26 BHP Billiton - APJ 2 - Solution Review 2 0 -2
16 2016-07-26 BHP Billiton - APJ Pre-Qualification 0 2 2
我想根据这个数据框创建一个新的数据框,其中 select 只有 "labels" 具有负 "difference" 值。但是,我想 select 所有类似的 "labels" 像这样:
datetime label option_title option_value lead difference
1 2016-07-22 GE 3 - Commercial Review 3 2 -1
2 2017-02-20 GE 2 - Solution Review 2 1 -1
3 2017-02-20 GE 1 - Opportunity Review 1 2 1
4 2017-04-18 GE 2 - Solution Review 2 3 1
5 2017-04-19 GE 3 - Commercial Review 3 4 1
6 2017-04-19 GE 4 - Submit Proposal 4 5 1
7 2017-08-08 GE 5 - Proposal Awarded 5 NA NA
8 2016-07-22 BHP Billiton - APJ Pre-Qualification 0 2 2
9 2016-07-26 BHP Billiton - APJ 2 - Solution Review 2 0 -2
10 2016-07-26 BHP Billiton - APJ Pre-Qualification 0 2 2
我不确定如何在 dplyr 中执行此操作....SQL 会更好吗? (我没怎么用过 R 中的 sql 包)
您可以在子选择上使用 in 子句
select * from my_table
where label in (
select label form my_table
where difference <0
)
或子选择上的联接
select * from my_table m
INNER JOIN (
select label form my_table
where difference <0
) t on m-label = t.lable
你可以用 R 来做,不需要为此使用 SQL 包。
示例数据
difference <- c(1, -2, 3, -5)
labels <- c("a", "b", "c", "d")
df <- data.frame(a, b)
您可以做一个简单的子集,其中您 select 具有负差异的值:
minus_df <- subset(df, difference<0)
最后,您创建一个标签列表(您可以在上一步中直接执行此操作,但最好检查数据是否正确。
m_labels <- minus_df$labels
尝试子集函数
df <- subset(df, sign(df$diff) == -1)
如果您的数据框被称为 df
那么这应该可以解决问题:
aux <- df$label[df$difference < 0]
df2 <- df[df$label %in% aux,]
aux 包含 df$difference < 0
处的所有标签。因此 df2
包含来自 df
且 labels
为 aux
的所有行。当然这也可以作为一个命令:
df2 <- df[df$label %in% df$label[df$difference < 0],]
或
df <- df[df$label %in% df$label[df$difference < 0],]
快速测试:
> df
label difference
1 test 2
2 test2 3
3 test2 -1
4 test3 -1
5 test4 4
6 test4 5
变成下面的df2
:
> df2
label difference
2 test2 3
3 test2 -1
4 test3 -1
如您所见,行编号现在是错误的。这是使用 row.names(df2) <- 1:NROW(df2)
修复的
> df2
label difference
1 test2 3
2 test2 -1
3 test3 -1
另一种可能的方法是使用 dplyr
:
library(dplyr)
df %>% group_by(label) %>% filter(any(difference < 0))
#> # A tibble: 10 x 6
#> # Groups: label [2]
#> datetime label option_title option_value lead
#> <date> <chr> <chr> <int> <int>
#> 1 2016-07-22 GE 3 - Commercial Review 3 2
#> 2 2017-02-20 GE 2 - Solution Review 2 1
#> 3 2017-02-20 GE 1 - Opportunity Review 1 2
#> 4 2017-04-18 GE 2 - Solution Review 2 3
#> 5 2017-04-19 GE 3 - Commercial Review 3 4
#> 6 2017-04-19 GE 4 - Submit Proposal 4 5
#> 7 2017-08-08 GE 5 - Proposal Awarded 5 NA
#> 8 2016-07-22 BHP Billiton - APJ Pre-Qualification 0 2
#> 9 2016-07-26 BHP Billiton - APJ 2 - Solution Review 2 0
#> 10 2016-07-26 BHP Billiton - APJ Pre-Qualification 0 2
#> # ... with 1 more variables: difference <int>
数据
library(readr)
df <- read_csv("rowid, datetime, label, option_title, option_value, lead, difference
1, 2016-07-22, GE, 3 - Commercial Review, 3, 2, -1
2, 2017-02-20, GE, 2 - Solution Review, 2, 1, -1
3, 2017-02-20, GE, 1 - Opportunity Review, 1, 2, 1
4, 2017-04-18, GE, 2 - Solution Review, 2, 3, 1
5, 2017-04-19, GE, 3 - Commercial Review, 3, 4, 1
6, 2017-04-19, GE, 4 - Submit Proposal, 4, 5, 1
7, 2017-08-08, GE, 5 - Proposal Awarded, 5, NA, NA
8, 2016-08-02, HSBC, 5 - Proposal Awarded, 5, 6, 1
9, 2016-12-13, HSBC, 6 - Delivery Phase 1, 6, 7, 1
10, 2017-08-07, HSBC, 7 - Phase 1 Live, 7, NA, NA
11, 2016-07-22, Lowes, Pre-Qualification, 0, NA, NA
12, 2016-08-02, Danske Bank, 6 - Delivery Phase 1, 6, NA, NA
13, 2016-07-22, AP Moller Maersk (IT Transformation), 3 - Commercial Review, 3, NA, NA
14, 2016-07-22, BHP Billiton - APJ, Pre-Qualification, 0, 2, 2
15, 2016-07-26, BHP Billiton - APJ, 2 - Solution Review, 2, 0, -2
16, 2016-07-26, BHP Billiton - APJ, Pre-Qualification, 0, 2, 2")
df <- df[-1]
datetime label option_title option_value lead difference
1 2016-07-22 GE 3 - Commercial Review 3 2 -1
2 2017-02-20 GE 2 - Solution Review 2 1 -1
3 2017-02-20 GE 1 - Opportunity Review 1 2 1
4 2017-04-18 GE 2 - Solution Review 2 3 1
5 2017-04-19 GE 3 - Commercial Review 3 4 1
6 2017-04-19 GE 4 - Submit Proposal 4 5 1
7 2017-08-08 GE 5 - Proposal Awarded 5 NA NA
8 2016-08-02 HSBC 5 - Proposal Awarded 5 6 1
9 2016-12-13 HSBC 6 - Delivery Phase 1 6 7 1
10 2017-08-07 HSBC 7 - Phase 1 Live 7 NA NA
11 2016-07-22 Lowes Pre-Qualification 0 NA NA
12 2016-08-02 Danske Bank 6 - Delivery Phase 1 6 NA NA
13 2016-07-22 AP Moller Maersk (IT Transformation) 3 - Commercial Review 3 NA NA
14 2016-07-22 BHP Billiton - APJ Pre-Qualification 0 2 2
15 2016-07-26 BHP Billiton - APJ 2 - Solution Review 2 0 -2
16 2016-07-26 BHP Billiton - APJ Pre-Qualification 0 2 2
我想根据这个数据框创建一个新的数据框,其中 select 只有 "labels" 具有负 "difference" 值。但是,我想 select 所有类似的 "labels" 像这样:
datetime label option_title option_value lead difference
1 2016-07-22 GE 3 - Commercial Review 3 2 -1
2 2017-02-20 GE 2 - Solution Review 2 1 -1
3 2017-02-20 GE 1 - Opportunity Review 1 2 1
4 2017-04-18 GE 2 - Solution Review 2 3 1
5 2017-04-19 GE 3 - Commercial Review 3 4 1
6 2017-04-19 GE 4 - Submit Proposal 4 5 1
7 2017-08-08 GE 5 - Proposal Awarded 5 NA NA
8 2016-07-22 BHP Billiton - APJ Pre-Qualification 0 2 2
9 2016-07-26 BHP Billiton - APJ 2 - Solution Review 2 0 -2
10 2016-07-26 BHP Billiton - APJ Pre-Qualification 0 2 2
我不确定如何在 dplyr 中执行此操作....SQL 会更好吗? (我没怎么用过 R 中的 sql 包)
您可以在子选择上使用 in 子句
select * from my_table
where label in (
select label form my_table
where difference <0
)
或子选择上的联接
select * from my_table m
INNER JOIN (
select label form my_table
where difference <0
) t on m-label = t.lable
你可以用 R 来做,不需要为此使用 SQL 包。
示例数据
difference <- c(1, -2, 3, -5)
labels <- c("a", "b", "c", "d")
df <- data.frame(a, b)
您可以做一个简单的子集,其中您 select 具有负差异的值:
minus_df <- subset(df, difference<0)
最后,您创建一个标签列表(您可以在上一步中直接执行此操作,但最好检查数据是否正确。
m_labels <- minus_df$labels
尝试子集函数
df <- subset(df, sign(df$diff) == -1)
如果您的数据框被称为 df
那么这应该可以解决问题:
aux <- df$label[df$difference < 0]
df2 <- df[df$label %in% aux,]
aux 包含 df$difference < 0
处的所有标签。因此 df2
包含来自 df
且 labels
为 aux
的所有行。当然这也可以作为一个命令:
df2 <- df[df$label %in% df$label[df$difference < 0],]
或
df <- df[df$label %in% df$label[df$difference < 0],]
快速测试:
> df
label difference
1 test 2
2 test2 3
3 test2 -1
4 test3 -1
5 test4 4
6 test4 5
变成下面的df2
:
> df2
label difference
2 test2 3
3 test2 -1
4 test3 -1
如您所见,行编号现在是错误的。这是使用 row.names(df2) <- 1:NROW(df2)
> df2
label difference
1 test2 3
2 test2 -1
3 test3 -1
另一种可能的方法是使用 dplyr
:
library(dplyr)
df %>% group_by(label) %>% filter(any(difference < 0))
#> # A tibble: 10 x 6
#> # Groups: label [2]
#> datetime label option_title option_value lead
#> <date> <chr> <chr> <int> <int>
#> 1 2016-07-22 GE 3 - Commercial Review 3 2
#> 2 2017-02-20 GE 2 - Solution Review 2 1
#> 3 2017-02-20 GE 1 - Opportunity Review 1 2
#> 4 2017-04-18 GE 2 - Solution Review 2 3
#> 5 2017-04-19 GE 3 - Commercial Review 3 4
#> 6 2017-04-19 GE 4 - Submit Proposal 4 5
#> 7 2017-08-08 GE 5 - Proposal Awarded 5 NA
#> 8 2016-07-22 BHP Billiton - APJ Pre-Qualification 0 2
#> 9 2016-07-26 BHP Billiton - APJ 2 - Solution Review 2 0
#> 10 2016-07-26 BHP Billiton - APJ Pre-Qualification 0 2
#> # ... with 1 more variables: difference <int>
数据
library(readr)
df <- read_csv("rowid, datetime, label, option_title, option_value, lead, difference
1, 2016-07-22, GE, 3 - Commercial Review, 3, 2, -1
2, 2017-02-20, GE, 2 - Solution Review, 2, 1, -1
3, 2017-02-20, GE, 1 - Opportunity Review, 1, 2, 1
4, 2017-04-18, GE, 2 - Solution Review, 2, 3, 1
5, 2017-04-19, GE, 3 - Commercial Review, 3, 4, 1
6, 2017-04-19, GE, 4 - Submit Proposal, 4, 5, 1
7, 2017-08-08, GE, 5 - Proposal Awarded, 5, NA, NA
8, 2016-08-02, HSBC, 5 - Proposal Awarded, 5, 6, 1
9, 2016-12-13, HSBC, 6 - Delivery Phase 1, 6, 7, 1
10, 2017-08-07, HSBC, 7 - Phase 1 Live, 7, NA, NA
11, 2016-07-22, Lowes, Pre-Qualification, 0, NA, NA
12, 2016-08-02, Danske Bank, 6 - Delivery Phase 1, 6, NA, NA
13, 2016-07-22, AP Moller Maersk (IT Transformation), 3 - Commercial Review, 3, NA, NA
14, 2016-07-22, BHP Billiton - APJ, Pre-Qualification, 0, 2, 2
15, 2016-07-26, BHP Billiton - APJ, 2 - Solution Review, 2, 0, -2
16, 2016-07-26, BHP Billiton - APJ, Pre-Qualification, 0, 2, 2")
df <- df[-1]