尝试改进我的代码 - 多列布尔运算 - 数据框

Question

我正在研究 R 中的数据框，并努力提高我的编码技能。这是我需要子集化的数据框：

testdf<- data.frame(
  col1= c(
    paste("Ga", 1:3, sep = ''),
    paste("Gb", 1:3, sep = ''),
    paste("Gc", 1:3, sep = ''),
    paste("Gb", 1:3, sep = ''),
    paste("Ga", 1:3, sep = '')),
  col2 = c(
    paste("Gb", 4:6, sep = ''),
    paste("Ga", 1:3, sep = ''),
    paste("Ga", 1:3, sep = ''),
    paste("Gc", 1:3, sep = ''),
    paste("Ga", 4:6, sep = '')),
  stringsAsFactors = FALSE)
#

现在，我只想保留“Ga”和“Gb”之间的比较。如您所见，有些行的“Ga”与“Gb”相比，有些行则相反。无论哪种方式，我都想保留它们。请注意，还有组内比较（即 Ga 与 Ga），我想将其丢弃。此外，在真实数据集中，其他组（在本例中只是“Gc”）比我想要保留的组多得多。

这是我的解决方案：

rbind(
  testdf[
    grepl(pattern = "Ga", x = testdf$col1) &
      grepl(pattern = "Gb", x = testdf$col2),],
  testdf[
    grepl(pattern = "Gb", x = testdf$col1) &
      grepl(pattern = "Ga", x = testdf$col2),])

我想知道是否有更简洁的解决方案，而不是执行两个单独的操作然后绑定它们。没什么大不了的，但我正在努力清理我的行为。我期待您的反馈:)

Answer 1

您可以通过对数据框进行一次子集化来做到这一点：

subset(testdf, grepl('Ga', col1) & grepl('Gb', col2) | 
               grepl('Gb', col1) & grepl('Ga', col2))


#  col1 col2
#1  Ga1  Gb4
#2  Ga2  Gb5
#3  Ga3  Gb6
#4  Gb1  Ga1
#5  Gb2  Ga2
#6  Gb3  Ga3

不使用 subset :

testdf[with(testdf, grepl('Ga', col1) & grepl('Gb', col2) | 
                    grepl('Gb', col1) & grepl('Ga', col2)),]

使用与 dplyr 和 stringr 相同的逻辑：

library(dplyr)
library(stringr)

testdf %>%  
   filter(str_detect(col1, 'Ga') & str_detect(col2, 'Gb') | 
          str_detect(col1, 'Gb') & str_detect(col2, 'Ga'))

尝试改进我的代码 - 多列布尔运算 - 数据框

Trying to improve my code - boolean operations on multiple columns - dataframe

r

subset

boolean-expression

dataframe