在 R 中有效地查找和匹配反向字符串

Finding and matching reversed strings efficiently in R

我有大量字符串 (~280,000),它们都具有以下格式 "ABC12D/XYZ34A"。在我的数据中,这些字符串中的每一个都有一个相同但相反的重复条目,例如"XYZ34A/ABC12D" 对于上面的例子。所以,我的数据看起来像这样:

1    "ABC12D/XYZ34A"
2    "TUR44F/SWP29R"
3    "PLL93S/WQQ22F"
4    "YNV77C/AAZ05S"
5    "SWP29R/TUR44F"
6    "AAZ05S/YNV77C"
7    "CLK86G/ERF74Q"
8    "XYZ34A/ABC12D"
9    "ERF74Q/CLK86G"
10   "WQQ22F/PLL93S"

第 1 行匹配第 8 行,第 2 行匹配第 5 行,依此类推

我的目标是:1) 对于给定的字符串,找到其反向条目的位置并保留该索引,然后 2) 将反向条目替换为非反向条目:

1   "ABC12D/XYZ34A" 8
2   "TUR44F/SWP29R" 5
3   "PLL93S/WQQ22F" 10
4   "YNV77C/AAZ05S" 6
5   "TUR44F/SWP29R" 0
6   "YNV77C/AAZ05S" 0
7   "CLK86G/ERF74Q" 9
8   "ABC12D/XYZ34A" 0
9    "CLK86G/ERF74Q" 0
10   "PLL93S/WQQ22F" 0

目前,我使用循环按以下方式执行此操作:

df <- data.frame(c("ABC12D/XYZ34A", "TUR44F/SWP29R", "PLL93S/WQQ22F", 
"YNV77C/AAZ05S", "SWP29R/TUR44F", "AAZ05S/YNV77C", "CLK86G/ERF74Q", 
"XYZ34A/ABC12D", "ERF74Q/CLK86G", "WQQ22F/PLL93S"), stringsAsFactors = 
FALSE)

colnames(df) <- "entries"
df

# Reverse function
reverse.entry <- function(string) {
  string.reversed <- paste(rev(strsplit(string, "/")[[1]]), collapse = '/')
  string.reversed
}

duplicate.flag <- list() 
duplicate.idx <- list() 

# Find and replace reversed entries
for (i in 1:dim(df)[[1]]) {
  # current entry
  string = df[i,]

  # reverse the current entry
  string.reversed <- reverse.entry(string)

  # if any other entry matches the reversed string get match index 
  if (grepl(string.reversed, df)) {

    print(sprintf("%d found a reversal", i))
    idx <- which(df == string.reversed)
    duplicate.flag[i] <- 1;
    duplicate.idx[i] <- idx;
    # replace reversed strings with original strings
    df[idx,] <- string
  } else {
    duplicate.flag[i] <- 0;
    duplicate.idx[i] <- 0;
  }

}

data.frame(df, unlist(duplicate.idx), unlist(duplicate.flag))

但是,这很慢,需要几个小时。有没有更好的编程方法?我对 R 和编程还很陌生,所以不太擅长矢量化等。因为每个条目都有一个反向条目,所以我也可以只使用 1:dim(df)[[1]] / 2 的循环。那会节省很多时间吗?

非常感谢!

你可以这样做...

df$no <- seq_along(df$entries) #number the entries
df$rev <- gsub("(.+)/(.+)","\2/\1",df$entries) #calculate reverse entries
df$whererev <- match(df$rev, df$entries) #identify where reversed entries occur
df$whererev[df$whererev>df$no] <- NA #remove the first of each duplicated pair 
df$entries[!is.na(df$whererev)] <- df$rev[!is.na(df$whererev)] #replace duplicates

df
   no       entries           rev whererev
1   1 ABC12D/XYZ34A XYZ34A/ABC12D       NA
2   2 TUR44F/SWP29R SWP29R/TUR44F       NA
3   3 PLL93S/WQQ22F WQQ22F/PLL93S       NA
4   4 YNV77C/AAZ05S AAZ05S/YNV77C       NA
5   5 TUR44F/SWP29R TUR44F/SWP29R        2
6   6 YNV77C/AAZ05S YNV77C/AAZ05S        4
7   7 CLK86G/ERF74Q ERF74Q/CLK86G       NA
8   8 ABC12D/XYZ34A ABC12D/XYZ34A        1
9   9 CLK86G/ERF74Q CLK86G/ERF74Q        7
10 10 PLL93S/WQQ22F PLL93S/WQQ22F        3

请注意,我标记了第二个重复项而不是第一个重复项,因为这使得替换第二个重复项变得更容易(并且可能快得多),而不必从第一个重复项中查找它。 (第 4 行将有 < 而不是 > 如果你想重新创建每个重复对中第一个的标记)。

这是我的解决方案:

require(data.table)
get_index <- function(string,values,current_index){
  string_present <- match(string,values)
  string_present[string_present<current_index] <- 0
  return(string_present)
}

mydata <- c("ABC12D/XYZ34A","TUR44F/SWP29R","PLL93S/WQQ22F","YNV77C/AAZ05S","SWP29R/TUR44F","AAZ05S/YNV77C","CLK86G/ERF74Q","XYZ34A/ABC12D","ERF74Q/CLK86G","WQQ22F/PLL93S")
mydf <- data.table(mystring = mydata,stringsAsFactors = FALSE)
mydf[,revmystring:=gsub("(.+)\/(.+)","\2\/\1",mystring)]
mydf[,duplicate_index:=get_index(revmystring,mystring,.I)]

它给出的解决方案是:

> mydf
         mystring   revmystring duplicate_index
 1: ABC12D/XYZ34A XYZ34A/ABC12D               8
 2: TUR44F/SWP29R SWP29R/TUR44F               5
 3: PLL93S/WQQ22F WQQ22F/PLL93S              10
 4: YNV77C/AAZ05S AAZ05S/YNV77C               6
 5: SWP29R/TUR44F TUR44F/SWP29R               0
 6: AAZ05S/YNV77C YNV77C/AAZ05S               0
 7: CLK86G/ERF74Q ERF74Q/CLK86G               9
 8: XYZ34A/ABC12D ABC12D/XYZ34A               0
 9: ERF74Q/CLK86G CLK86G/ERF74Q               0
10: WQQ22F/PLL93S PLL93S/WQQ22F               0

你也可以在没有 data.table 的情况下实现它。

这是一个使用 outergsub 的命题:

## Create a matrix of correspondence o between elements and reverses
o = outer(df[,1],df[,1],function(x,y) gsub("(.*)/(.*)","\2/\1",y)==x)
o[upper.tri(o)] = F
## Identify the indices of correspondence
df$ind = unlist(apply(o,2,function(x) which(x==T)[1]))
df$ind[is.na(df$ind)] = 0
## Replace reverses by originals
df[,1][df$ind[df$ind!=0]] = df[,1][df$ind!=0]

这个returns:

        V1        ind
1  ABC12D/XYZ34A   8
2  TUR44F/SWP29R   5
3  PLL93S/WQQ22F  10
4  YNV77C/AAZ05S   6
5  TUR44F/SWP29R   0
6  YNV77C/AAZ05S   0
7  CLK86G/ERF74Q   9
8  ABC12D/XYZ34A   0
9  CLK86G/ERF74Q   0
10 PLL93S/WQQ22F   0