如何按句子在两个字符串中找到不同的单词?
How do I find differing words in two strings, sentence-wise?
我正在比较两个相似的文本。 x1
是模型文本,x2
是有错误的文本(例如拼写、新字符等)。我正在尝试删除在两个文本中找到的单词。由于我的实际文本不是英文的,所以我无法使用字典。
我尝试过遍历 x1
的每个字符,如果它是 x2
中的相同字符,则从 x2
中删除并移至 [=12] 的下一个字符=].
我一直在研究的代码:
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
library(tidyverse)
x1 <- str_split(x1, "(?<=\.)\s")
x1 <- lapply(x1,tolower)
x2 <- str_split(x2, "(?<=\.)\s")
x2 <- lapply(x2,tolower)
delete_a_from_b <- function(a,b) {
a_as_list <- str_remove_all(a,"word") %>%
str_split(boundary("character")) %>% unlist
b_n <- nchar(b)
b_as_list <- str_remove_all(b,"word") %>%
str_split(boundary("character")) %>% unlist
previous_j <-1
for (i in 1:length(a_as_list)) {
if(previous_j > length(b_as_list))
break
for (j in previous_j:length(b_as_list)){
if(a_as_list[[i]]==b_as_list[[j]]){
b_as_list[[j]] <- ""
previous_j <- j+1
break
}
}
}
print(paste0(b_as_list,collapse = ""))
paste0(b_as_list,collapse = "")
}
x3 <- delete_a_from_b(x1,x2)
x3 <- strsplit(x3,"\s")
输出:
x3
[[1]]
[1] "text" "this" "i" "i" "d?am" "clueless" "thius" "coing.\","
[9] "\"ley’s" "dythsssoon" "as" "possibk"
我想要的结果是:'text' 'this' 'id' 'thius' 'ley’s' 'ythis' 'possiblke'
我觉得我做到了,这是你需要的吗?
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
x1_w<-strsplit(paste(x1, collapse = " "), ' ')[[1]]
x2_w<-strsplit(paste(x2, collapse = " "), ' ')[[1]]
x1<- lapply(x1,tolower)
x2<- lapply(x2,tolower)
`%notin%` <- Negate(`%in%`)
x2_w[which(x2_w %notin% x1_w)]
# same as:
setdiff(x2_w,x1_w)
# out:
#> x2_w[which(x2_w %notin% x1_w)]
#[1] "text" "id" "thius" "ley’s" "ythis" "possiblke."
我认为你想比较两个字符串 x1
和 x2
按句子 - 在问题中不是很清楚。以前的解决方案没有考虑到这一点。
试试这个:
首先拆分,将两个字符串分成句子:
x1_sentences <- unlist(strsplit(tolower(x1), split = "[.?!] "))
x2_sentences <- unlist(strsplit(tolower(x2), split = "[.?!] "))
length(x1_sentences) == length(x2_sentences) # Make sure same number of resulting sentences
然后,对于每个句子,再次拆分两个向量并显示单词差异:
for (i in 1:length(x1_sentences)) {
x1_vector <- unlist(strsplit(x1_sentences[i], split = "[ ]"))
x2_vector <- unlist(strsplit(x2_sentences[i], split = "[ ]"))
print(setdiff(x2_vector, x1_vector)) # The order here is important!
}
给出(您可以轻松将其转换为新向量):
[1] "text"
[1] "this"
[1] "id"
[1] "thius"
[1] "ley’s" "ythis" "possiblke."
我正在比较两个相似的文本。 x1
是模型文本,x2
是有错误的文本(例如拼写、新字符等)。我正在尝试删除在两个文本中找到的单词。由于我的实际文本不是英文的,所以我无法使用字典。
我尝试过遍历 x1
的每个字符,如果它是 x2
中的相同字符,则从 x2
中删除并移至 [=12] 的下一个字符=].
我一直在研究的代码:
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
library(tidyverse)
x1 <- str_split(x1, "(?<=\.)\s")
x1 <- lapply(x1,tolower)
x2 <- str_split(x2, "(?<=\.)\s")
x2 <- lapply(x2,tolower)
delete_a_from_b <- function(a,b) {
a_as_list <- str_remove_all(a,"word") %>%
str_split(boundary("character")) %>% unlist
b_n <- nchar(b)
b_as_list <- str_remove_all(b,"word") %>%
str_split(boundary("character")) %>% unlist
previous_j <-1
for (i in 1:length(a_as_list)) {
if(previous_j > length(b_as_list))
break
for (j in previous_j:length(b_as_list)){
if(a_as_list[[i]]==b_as_list[[j]]){
b_as_list[[j]] <- ""
previous_j <- j+1
break
}
}
}
print(paste0(b_as_list,collapse = ""))
paste0(b_as_list,collapse = "")
}
x3 <- delete_a_from_b(x1,x2)
x3 <- strsplit(x3,"\s")
输出:
x3
[[1]]
[1] "text" "this" "i" "i" "d?am" "clueless" "thius" "coing.\","
[9] "\"ley’s" "dythsssoon" "as" "possibk"
我想要的结果是:'text' 'this' 'id' 'thius' 'ley’s' 'ythis' 'possiblke'
我觉得我做到了,这是你需要的吗?
x1 <- "This is a test. Weather is fine. What do I do? I am clueless this coding. Let’s do this as soon as possible."
x2 <- "This text is a test. This weather is fine. What id I do? I am clueless thius coding. Ley’s do ythis as soon as possiblke."
x1_w<-strsplit(paste(x1, collapse = " "), ' ')[[1]]
x2_w<-strsplit(paste(x2, collapse = " "), ' ')[[1]]
x1<- lapply(x1,tolower)
x2<- lapply(x2,tolower)
`%notin%` <- Negate(`%in%`)
x2_w[which(x2_w %notin% x1_w)]
# same as:
setdiff(x2_w,x1_w)
# out:
#> x2_w[which(x2_w %notin% x1_w)]
#[1] "text" "id" "thius" "ley’s" "ythis" "possiblke."
我认为你想比较两个字符串 x1
和 x2
按句子 - 在问题中不是很清楚。以前的解决方案没有考虑到这一点。
试试这个:
首先拆分,将两个字符串分成句子:
x1_sentences <- unlist(strsplit(tolower(x1), split = "[.?!] "))
x2_sentences <- unlist(strsplit(tolower(x2), split = "[.?!] "))
length(x1_sentences) == length(x2_sentences) # Make sure same number of resulting sentences
然后,对于每个句子,再次拆分两个向量并显示单词差异:
for (i in 1:length(x1_sentences)) {
x1_vector <- unlist(strsplit(x1_sentences[i], split = "[ ]"))
x2_vector <- unlist(strsplit(x2_sentences[i], split = "[ ]"))
print(setdiff(x2_vector, x1_vector)) # The order here is important!
}
给出(您可以轻松将其转换为新向量):
[1] "text"
[1] "this"
[1] "id"
[1] "thius"
[1] "ley’s" "ythis" "possiblke."