在 R 中有效地查找和匹配反向字符串
Finding and matching reversed strings efficiently in R
我有大量字符串 (~280,000),它们都具有以下格式 "ABC12D/XYZ34A"。在我的数据中,这些字符串中的每一个都有一个相同但相反的重复条目,例如"XYZ34A/ABC12D" 对于上面的例子。所以,我的数据看起来像这样:
1 "ABC12D/XYZ34A"
2 "TUR44F/SWP29R"
3 "PLL93S/WQQ22F"
4 "YNV77C/AAZ05S"
5 "SWP29R/TUR44F"
6 "AAZ05S/YNV77C"
7 "CLK86G/ERF74Q"
8 "XYZ34A/ABC12D"
9 "ERF74Q/CLK86G"
10 "WQQ22F/PLL93S"
第 1 行匹配第 8 行,第 2 行匹配第 5 行,依此类推
我的目标是:1) 对于给定的字符串,找到其反向条目的位置并保留该索引,然后 2) 将反向条目替换为非反向条目:
1 "ABC12D/XYZ34A" 8
2 "TUR44F/SWP29R" 5
3 "PLL93S/WQQ22F" 10
4 "YNV77C/AAZ05S" 6
5 "TUR44F/SWP29R" 0
6 "YNV77C/AAZ05S" 0
7 "CLK86G/ERF74Q" 9
8 "ABC12D/XYZ34A" 0
9 "CLK86G/ERF74Q" 0
10 "PLL93S/WQQ22F" 0
目前,我使用循环按以下方式执行此操作:
df <- data.frame(c("ABC12D/XYZ34A", "TUR44F/SWP29R", "PLL93S/WQQ22F",
"YNV77C/AAZ05S", "SWP29R/TUR44F", "AAZ05S/YNV77C", "CLK86G/ERF74Q",
"XYZ34A/ABC12D", "ERF74Q/CLK86G", "WQQ22F/PLL93S"), stringsAsFactors =
FALSE)
colnames(df) <- "entries"
df
# Reverse function
reverse.entry <- function(string) {
string.reversed <- paste(rev(strsplit(string, "/")[[1]]), collapse = '/')
string.reversed
}
duplicate.flag <- list()
duplicate.idx <- list()
# Find and replace reversed entries
for (i in 1:dim(df)[[1]]) {
# current entry
string = df[i,]
# reverse the current entry
string.reversed <- reverse.entry(string)
# if any other entry matches the reversed string get match index
if (grepl(string.reversed, df)) {
print(sprintf("%d found a reversal", i))
idx <- which(df == string.reversed)
duplicate.flag[i] <- 1;
duplicate.idx[i] <- idx;
# replace reversed strings with original strings
df[idx,] <- string
} else {
duplicate.flag[i] <- 0;
duplicate.idx[i] <- 0;
}
}
data.frame(df, unlist(duplicate.idx), unlist(duplicate.flag))
但是,这很慢,需要几个小时。有没有更好的编程方法?我对 R 和编程还很陌生,所以不太擅长矢量化等。因为每个条目都有一个反向条目,所以我也可以只使用 1:dim(df)[[1]] / 2 的循环。那会节省很多时间吗?
非常感谢!
你可以这样做...
df$no <- seq_along(df$entries) #number the entries
df$rev <- gsub("(.+)/(.+)","\2/\1",df$entries) #calculate reverse entries
df$whererev <- match(df$rev, df$entries) #identify where reversed entries occur
df$whererev[df$whererev>df$no] <- NA #remove the first of each duplicated pair
df$entries[!is.na(df$whererev)] <- df$rev[!is.na(df$whererev)] #replace duplicates
df
no entries rev whererev
1 1 ABC12D/XYZ34A XYZ34A/ABC12D NA
2 2 TUR44F/SWP29R SWP29R/TUR44F NA
3 3 PLL93S/WQQ22F WQQ22F/PLL93S NA
4 4 YNV77C/AAZ05S AAZ05S/YNV77C NA
5 5 TUR44F/SWP29R TUR44F/SWP29R 2
6 6 YNV77C/AAZ05S YNV77C/AAZ05S 4
7 7 CLK86G/ERF74Q ERF74Q/CLK86G NA
8 8 ABC12D/XYZ34A ABC12D/XYZ34A 1
9 9 CLK86G/ERF74Q CLK86G/ERF74Q 7
10 10 PLL93S/WQQ22F PLL93S/WQQ22F 3
请注意,我标记了第二个重复项而不是第一个重复项,因为这使得替换第二个重复项变得更容易(并且可能快得多),而不必从第一个重复项中查找它。 (第 4 行将有 <
而不是 >
如果你想重新创建每个重复对中第一个的标记)。
这是我的解决方案:
require(data.table)
get_index <- function(string,values,current_index){
string_present <- match(string,values)
string_present[string_present<current_index] <- 0
return(string_present)
}
mydata <- c("ABC12D/XYZ34A","TUR44F/SWP29R","PLL93S/WQQ22F","YNV77C/AAZ05S","SWP29R/TUR44F","AAZ05S/YNV77C","CLK86G/ERF74Q","XYZ34A/ABC12D","ERF74Q/CLK86G","WQQ22F/PLL93S")
mydf <- data.table(mystring = mydata,stringsAsFactors = FALSE)
mydf[,revmystring:=gsub("(.+)\/(.+)","\2\/\1",mystring)]
mydf[,duplicate_index:=get_index(revmystring,mystring,.I)]
它给出的解决方案是:
> mydf
mystring revmystring duplicate_index
1: ABC12D/XYZ34A XYZ34A/ABC12D 8
2: TUR44F/SWP29R SWP29R/TUR44F 5
3: PLL93S/WQQ22F WQQ22F/PLL93S 10
4: YNV77C/AAZ05S AAZ05S/YNV77C 6
5: SWP29R/TUR44F TUR44F/SWP29R 0
6: AAZ05S/YNV77C YNV77C/AAZ05S 0
7: CLK86G/ERF74Q ERF74Q/CLK86G 9
8: XYZ34A/ABC12D ABC12D/XYZ34A 0
9: ERF74Q/CLK86G CLK86G/ERF74Q 0
10: WQQ22F/PLL93S PLL93S/WQQ22F 0
你也可以在没有 data.table
的情况下实现它。
这是一个使用 outer
和 gsub
的命题:
## Create a matrix of correspondence o between elements and reverses
o = outer(df[,1],df[,1],function(x,y) gsub("(.*)/(.*)","\2/\1",y)==x)
o[upper.tri(o)] = F
## Identify the indices of correspondence
df$ind = unlist(apply(o,2,function(x) which(x==T)[1]))
df$ind[is.na(df$ind)] = 0
## Replace reverses by originals
df[,1][df$ind[df$ind!=0]] = df[,1][df$ind!=0]
这个returns:
V1 ind
1 ABC12D/XYZ34A 8
2 TUR44F/SWP29R 5
3 PLL93S/WQQ22F 10
4 YNV77C/AAZ05S 6
5 TUR44F/SWP29R 0
6 YNV77C/AAZ05S 0
7 CLK86G/ERF74Q 9
8 ABC12D/XYZ34A 0
9 CLK86G/ERF74Q 0
10 PLL93S/WQQ22F 0
我有大量字符串 (~280,000),它们都具有以下格式 "ABC12D/XYZ34A"。在我的数据中,这些字符串中的每一个都有一个相同但相反的重复条目,例如"XYZ34A/ABC12D" 对于上面的例子。所以,我的数据看起来像这样:
1 "ABC12D/XYZ34A"
2 "TUR44F/SWP29R"
3 "PLL93S/WQQ22F"
4 "YNV77C/AAZ05S"
5 "SWP29R/TUR44F"
6 "AAZ05S/YNV77C"
7 "CLK86G/ERF74Q"
8 "XYZ34A/ABC12D"
9 "ERF74Q/CLK86G"
10 "WQQ22F/PLL93S"
第 1 行匹配第 8 行,第 2 行匹配第 5 行,依此类推
我的目标是:1) 对于给定的字符串,找到其反向条目的位置并保留该索引,然后 2) 将反向条目替换为非反向条目:
1 "ABC12D/XYZ34A" 8
2 "TUR44F/SWP29R" 5
3 "PLL93S/WQQ22F" 10
4 "YNV77C/AAZ05S" 6
5 "TUR44F/SWP29R" 0
6 "YNV77C/AAZ05S" 0
7 "CLK86G/ERF74Q" 9
8 "ABC12D/XYZ34A" 0
9 "CLK86G/ERF74Q" 0
10 "PLL93S/WQQ22F" 0
目前,我使用循环按以下方式执行此操作:
df <- data.frame(c("ABC12D/XYZ34A", "TUR44F/SWP29R", "PLL93S/WQQ22F",
"YNV77C/AAZ05S", "SWP29R/TUR44F", "AAZ05S/YNV77C", "CLK86G/ERF74Q",
"XYZ34A/ABC12D", "ERF74Q/CLK86G", "WQQ22F/PLL93S"), stringsAsFactors =
FALSE)
colnames(df) <- "entries"
df
# Reverse function
reverse.entry <- function(string) {
string.reversed <- paste(rev(strsplit(string, "/")[[1]]), collapse = '/')
string.reversed
}
duplicate.flag <- list()
duplicate.idx <- list()
# Find and replace reversed entries
for (i in 1:dim(df)[[1]]) {
# current entry
string = df[i,]
# reverse the current entry
string.reversed <- reverse.entry(string)
# if any other entry matches the reversed string get match index
if (grepl(string.reversed, df)) {
print(sprintf("%d found a reversal", i))
idx <- which(df == string.reversed)
duplicate.flag[i] <- 1;
duplicate.idx[i] <- idx;
# replace reversed strings with original strings
df[idx,] <- string
} else {
duplicate.flag[i] <- 0;
duplicate.idx[i] <- 0;
}
}
data.frame(df, unlist(duplicate.idx), unlist(duplicate.flag))
但是,这很慢,需要几个小时。有没有更好的编程方法?我对 R 和编程还很陌生,所以不太擅长矢量化等。因为每个条目都有一个反向条目,所以我也可以只使用 1:dim(df)[[1]] / 2 的循环。那会节省很多时间吗?
非常感谢!
你可以这样做...
df$no <- seq_along(df$entries) #number the entries
df$rev <- gsub("(.+)/(.+)","\2/\1",df$entries) #calculate reverse entries
df$whererev <- match(df$rev, df$entries) #identify where reversed entries occur
df$whererev[df$whererev>df$no] <- NA #remove the first of each duplicated pair
df$entries[!is.na(df$whererev)] <- df$rev[!is.na(df$whererev)] #replace duplicates
df
no entries rev whererev
1 1 ABC12D/XYZ34A XYZ34A/ABC12D NA
2 2 TUR44F/SWP29R SWP29R/TUR44F NA
3 3 PLL93S/WQQ22F WQQ22F/PLL93S NA
4 4 YNV77C/AAZ05S AAZ05S/YNV77C NA
5 5 TUR44F/SWP29R TUR44F/SWP29R 2
6 6 YNV77C/AAZ05S YNV77C/AAZ05S 4
7 7 CLK86G/ERF74Q ERF74Q/CLK86G NA
8 8 ABC12D/XYZ34A ABC12D/XYZ34A 1
9 9 CLK86G/ERF74Q CLK86G/ERF74Q 7
10 10 PLL93S/WQQ22F PLL93S/WQQ22F 3
请注意,我标记了第二个重复项而不是第一个重复项,因为这使得替换第二个重复项变得更容易(并且可能快得多),而不必从第一个重复项中查找它。 (第 4 行将有 <
而不是 >
如果你想重新创建每个重复对中第一个的标记)。
这是我的解决方案:
require(data.table)
get_index <- function(string,values,current_index){
string_present <- match(string,values)
string_present[string_present<current_index] <- 0
return(string_present)
}
mydata <- c("ABC12D/XYZ34A","TUR44F/SWP29R","PLL93S/WQQ22F","YNV77C/AAZ05S","SWP29R/TUR44F","AAZ05S/YNV77C","CLK86G/ERF74Q","XYZ34A/ABC12D","ERF74Q/CLK86G","WQQ22F/PLL93S")
mydf <- data.table(mystring = mydata,stringsAsFactors = FALSE)
mydf[,revmystring:=gsub("(.+)\/(.+)","\2\/\1",mystring)]
mydf[,duplicate_index:=get_index(revmystring,mystring,.I)]
它给出的解决方案是:
> mydf
mystring revmystring duplicate_index
1: ABC12D/XYZ34A XYZ34A/ABC12D 8
2: TUR44F/SWP29R SWP29R/TUR44F 5
3: PLL93S/WQQ22F WQQ22F/PLL93S 10
4: YNV77C/AAZ05S AAZ05S/YNV77C 6
5: SWP29R/TUR44F TUR44F/SWP29R 0
6: AAZ05S/YNV77C YNV77C/AAZ05S 0
7: CLK86G/ERF74Q ERF74Q/CLK86G 9
8: XYZ34A/ABC12D ABC12D/XYZ34A 0
9: ERF74Q/CLK86G CLK86G/ERF74Q 0
10: WQQ22F/PLL93S PLL93S/WQQ22F 0
你也可以在没有 data.table
的情况下实现它。
这是一个使用 outer
和 gsub
的命题:
## Create a matrix of correspondence o between elements and reverses
o = outer(df[,1],df[,1],function(x,y) gsub("(.*)/(.*)","\2/\1",y)==x)
o[upper.tri(o)] = F
## Identify the indices of correspondence
df$ind = unlist(apply(o,2,function(x) which(x==T)[1]))
df$ind[is.na(df$ind)] = 0
## Replace reverses by originals
df[,1][df$ind[df$ind!=0]] = df[,1][df$ind!=0]
这个returns:
V1 ind
1 ABC12D/XYZ34A 8
2 TUR44F/SWP29R 5
3 PLL93S/WQQ22F 10
4 YNV77C/AAZ05S 6
5 TUR44F/SWP29R 0
6 YNV77C/AAZ05S 0
7 CLK86G/ERF74Q 9
8 ABC12D/XYZ34A 0
9 CLK86G/ERF74Q 0
10 PLL93S/WQQ22F 0