R 在文本中查找相似的句子
R Find similar sentences in texts
我遇到了一个问题,我正在努力寻找解决方案或方法来解决它。
我有一些范例,例如
model_sentences = data.frame("model_id" = c("model_id_1", "model_id_2"), "model_text" = c("Company x had 3000 employees in 2016.",
"Google makes 300 dollar in revenue in 2018."))
和一些文本
data = data.frame("id" = c("id1", "id2"), "text" = c("Company y is expected to employ 2000 employees in 2020. This is an increase of 10%. Some stupid sentences.",
"Amazon´s revenue is 400 dollar in 2020. That is twice as much as last year."))
我想从那些文本中提取与模型句子相似的句子。
像这样的东西将是我想要的解决方案
result = data.frame("id" = c("id1", "id2"), "model_id" = c("model_id_1", "model_id_2"), "sentence_from_data" = c("Company y is expected to employ 2000 employees in 2020.", "Amazon´s revenue is 400 dollar in 2020."), "score" = c(0.5, 0.4))
也许可以找到一种 'similarity_score'。
我用这个功能按句分割文本:
split_by_sentence <- function (text) {
result <-unlist(strsplit(text, "(?<=[[:alnum:]]{4}[?!.])\s+", perl=TRUE))
result <- stri_trim_both(result)
result <- result [nchar (result) > 0]
if (length (result) == 0)
result <- ""
return (result)
}
但我不知道如何将每个句子与范例句子进行比较。
我很高兴提出任何建议。
查看此包 stringdist
示例:
library(stringdist)
mysent = "This is a sentence"
apply(model_sentences, 1, function(row) {
stringdist(row['model_text'], mysent, method="jaccard")
})
它将 return 从 mysent 到 model_text 变量的 jaccard 距离。值越小,句子在给定的距离度量方面越相似。
我遇到了一个问题,我正在努力寻找解决方案或方法来解决它。
我有一些范例,例如
model_sentences = data.frame("model_id" = c("model_id_1", "model_id_2"), "model_text" = c("Company x had 3000 employees in 2016.",
"Google makes 300 dollar in revenue in 2018."))
和一些文本
data = data.frame("id" = c("id1", "id2"), "text" = c("Company y is expected to employ 2000 employees in 2020. This is an increase of 10%. Some stupid sentences.",
"Amazon´s revenue is 400 dollar in 2020. That is twice as much as last year."))
我想从那些文本中提取与模型句子相似的句子。
像这样的东西将是我想要的解决方案
result = data.frame("id" = c("id1", "id2"), "model_id" = c("model_id_1", "model_id_2"), "sentence_from_data" = c("Company y is expected to employ 2000 employees in 2020.", "Amazon´s revenue is 400 dollar in 2020."), "score" = c(0.5, 0.4))
也许可以找到一种 'similarity_score'。
我用这个功能按句分割文本:
split_by_sentence <- function (text) {
result <-unlist(strsplit(text, "(?<=[[:alnum:]]{4}[?!.])\s+", perl=TRUE))
result <- stri_trim_both(result)
result <- result [nchar (result) > 0]
if (length (result) == 0)
result <- ""
return (result)
}
但我不知道如何将每个句子与范例句子进行比较。 我很高兴提出任何建议。
查看此包 stringdist
示例:
library(stringdist)
mysent = "This is a sentence"
apply(model_sentences, 1, function(row) {
stringdist(row['model_text'], mysent, method="jaccard")
})
它将 return 从 mysent 到 model_text 变量的 jaccard 距离。值越小,句子在给定的距离度量方面越相似。