将 "String matching to estimate similarity" 应用于数据框
Applying "String matching to estimate similarity" to data frame
String matching to estimate similarity
上面的代码正是我要找的,除了我似乎无法弄清楚如何比较数据框中列之间的字符串("correct" 答案和 "given" 答案)然后将 sim.per 的输出存储为同一数据框中的新列 ("similarity")。我试过了。例如,
df$similarity <- sim.per(df$answer, df$given)
df$similarity <- mapply(sim.per, df$answer, df$given)
后者在行为空时也会导致错误,这在我的数据集中是可以接受的,应该计算为 0。
Error in str2[[1]] : subscript out of bounds
预期输出应为:
answer given similarity
1 Best way to waste money Instrument to waste money and time 0.6
2 Roy travels to Africa He is in Africa 0.25
3 I go to work 0
如有任何帮助,我们将不胜感激!谢谢!
数据子集:
df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))
您可以执行此操作的一种方法是使用 for 循环并遍历数据框中的每一行,以使用其他线程中的函数计算相似性百分比。
df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L,
7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money",
"He ran out of money, so he had to stop playing poker", "I go to work",
"Lets all be unique together until we realise we are all the same",
"Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet",
"She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do",
"The stranger officiates the meal", "We have a lot of rain in June"
), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L,
2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker",
"Instrument to waste money and time", "It was raining in June",
"People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it",
"She did her very best to help him out", "Students know not to cheat",
"The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
lower <- tolower(string1)
no.punct <- gsub("[[:punct:]]", "", lower)
split <- strsplit(no.punct, split=" ")
return(split)
}
sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
total <- length(str1[[1]])
per <- sim/total
return(per)
}
df$similarity <- 0
for (i in seq_len(nrow(df))) {
if (!is.na(df$answer[i]) | !is.na(df$given[i])) {
df$similarity[i] <- sim.per(format(df$answer[i]), format(df$given[i]))
}
}
df
trial answer given similarity
1 1 The stranger officiates the meal The guests ate the meal 0.4000000
2 2 He ran out of money, so he had to stop playing poker He lost money because he had played poker 0.3333333
3 3 Lets all be unique together until we realise we are all the same People are unique until they try to fit in 0.2307692
4 4 She did her best to help him She did her very best to help him out 1.0000000
5 5 Best way to waste money Instrument to waste money and time 0.6000000
6 6 Roy travels to Africa He is in Africa Roy 0.5000000
7 7 I go to work 0.0000000
8 8 She borrowed the book from him many years ago and did not returned it yet She borrowed the book from the library and forgot to return it 0.4666667
9 9 Students did not cheat on the test, for it was not the right thing to do Students know not to cheat 0.2500000
10 10 We have a lot of rain in June It was raining in June 0.2500000
这是一个使用 tidyverse
语法的示例,可以避免手动循环并使事情变得更简洁并且可能更快。特别是,格式步骤是矢量化的,因此只有分数计算需要迭代。
library(tidyverse)
df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))
format_str <- function(string) {
string %>%
str_to_lower %>%
str_remove_all("[:punct:]") %>%
str_squish %>%
str_split(" ")
}
df %>%
mutate(
similarity = map2_dbl(
.x = format_str(answer),
.y = format_str(given),
.f = ~ length(intersect(.x, .y)) / length(.x)
)
) %>%
as_tibble
#> # A tibble: 10 x 4
#> trial answer given similarity
#> <int> <fct> <fct> <dbl>
#> 1 1 The stranger officiates the ~ The guests ate the meal 0.4
#> 2 2 He ran out of money, so he h~ He lost money because h~ 0.333
#> 3 3 Lets all be unique together ~ People are unique until~ 0.231
#> 4 4 She did her best to help him She did her very best t~ 1
#> 5 5 Best way to waste money Instrument to waste mon~ 0.6
#> 6 6 Roy travels to Africa He is in Africa Roy 0.5
#> 7 7 I go to work "" 0
#> 8 8 She borrowed the book from h~ She borrowed the book f~ 0.467
#> 9 9 Students did not cheat on th~ Students know not to ch~ 0.25
#> 10 10 We have a lot of rain in June It was raining in June 0.25
由 reprex package (v0.2.0) 创建于 2018-08-17。
String matching to estimate similarity
上面的代码正是我要找的,除了我似乎无法弄清楚如何比较数据框中列之间的字符串("correct" 答案和 "given" 答案)然后将 sim.per 的输出存储为同一数据框中的新列 ("similarity")。我试过了。例如,
df$similarity <- sim.per(df$answer, df$given)
df$similarity <- mapply(sim.per, df$answer, df$given)
后者在行为空时也会导致错误,这在我的数据集中是可以接受的,应该计算为 0。
Error in str2[[1]] : subscript out of bounds
预期输出应为:
answer given similarity
1 Best way to waste money Instrument to waste money and time 0.6
2 Roy travels to Africa He is in Africa 0.25
3 I go to work 0
如有任何帮助,我们将不胜感激!谢谢!
数据子集:
df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))
您可以执行此操作的一种方法是使用 for 循环并遍历数据框中的每一行,以使用其他线程中的函数计算相似性百分比。
df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L,
7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money",
"He ran out of money, so he had to stop playing poker", "I go to work",
"Lets all be unique together until we realise we are all the same",
"Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet",
"She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do",
"The stranger officiates the meal", "We have a lot of rain in June"
), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L,
2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker",
"Instrument to waste money and time", "It was raining in June",
"People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it",
"She did her very best to help him out", "Students know not to cheat",
"The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA,
-10L))
format <- function(string1){ #removing the information from the string which presumably isn't important (punctuation, capital letters. then splitting all the words into separate strings)
lower <- tolower(string1)
no.punct <- gsub("[[:punct:]]", "", lower)
split <- strsplit(no.punct, split=" ")
return(split)
}
sim.per <- function(str1, str2, ...){#how similar is string 1 to string 2. NOTE: the order is important, ie. sim.per(b,c) is different from sim.per(c,b)
sim <- length(intersect(str1[[1]], str2[[1]]))#intersect function counts the common strings
total <- length(str1[[1]])
per <- sim/total
return(per)
}
df$similarity <- 0
for (i in seq_len(nrow(df))) {
if (!is.na(df$answer[i]) | !is.na(df$given[i])) {
df$similarity[i] <- sim.per(format(df$answer[i]), format(df$given[i]))
}
}
df
trial answer given similarity
1 1 The stranger officiates the meal The guests ate the meal 0.4000000
2 2 He ran out of money, so he had to stop playing poker He lost money because he had played poker 0.3333333
3 3 Lets all be unique together until we realise we are all the same People are unique until they try to fit in 0.2307692
4 4 She did her best to help him She did her very best to help him out 1.0000000
5 5 Best way to waste money Instrument to waste money and time 0.6000000
6 6 Roy travels to Africa He is in Africa Roy 0.5000000
7 7 I go to work 0.0000000
8 8 She borrowed the book from him many years ago and did not returned it yet She borrowed the book from the library and forgot to return it 0.4666667
9 9 Students did not cheat on the test, for it was not the right thing to do Students know not to cheat 0.2500000
10 10 We have a lot of rain in June It was raining in June 0.2500000
这是一个使用 tidyverse
语法的示例,可以避免手动循环并使事情变得更简洁并且可能更快。特别是,格式步骤是矢量化的,因此只有分数计算需要迭代。
library(tidyverse)
df <- structure(list(trial = 1:10, answer = structure(c(9L, 2L, 4L, 7L, 1L, 5L, 3L, 6L, 8L, 10L), .Label = c("Best way to waste money", "He ran out of money, so he had to stop playing poker", "I go to work", "Lets all be unique together until we realise we are all the same", "Roy travels to Africa", "She borrowed the book from him many years ago and did not returned it yet", "She did her best to help him", "Students did not cheat on the test, for it was not the right thing to do", "The stranger officiates the meal", "We have a lot of rain in June"), class = "factor"), given = structure(c(10L, 3L, 6L, 8L, 4L, 2L, 1L, 7L, 9L, 5L), .Label = c("", "He is in Africa Roy", "He lost money because he had played poker", "Instrument to waste money and time", "It was raining in June", "People are unique until they try to fit in", "She borrowed the book from the library and forgot to return it", "She did her very best to help him out", "Students know not to cheat", "The guests ate the meal"), class = "factor")), class = "data.frame", row.names = c(NA, -10L))
format_str <- function(string) {
string %>%
str_to_lower %>%
str_remove_all("[:punct:]") %>%
str_squish %>%
str_split(" ")
}
df %>%
mutate(
similarity = map2_dbl(
.x = format_str(answer),
.y = format_str(given),
.f = ~ length(intersect(.x, .y)) / length(.x)
)
) %>%
as_tibble
#> # A tibble: 10 x 4
#> trial answer given similarity
#> <int> <fct> <fct> <dbl>
#> 1 1 The stranger officiates the ~ The guests ate the meal 0.4
#> 2 2 He ran out of money, so he h~ He lost money because h~ 0.333
#> 3 3 Lets all be unique together ~ People are unique until~ 0.231
#> 4 4 She did her best to help him She did her very best t~ 1
#> 5 5 Best way to waste money Instrument to waste mon~ 0.6
#> 6 6 Roy travels to Africa He is in Africa Roy 0.5
#> 7 7 I go to work "" 0
#> 8 8 She borrowed the book from h~ She borrowed the book f~ 0.467
#> 9 9 Students did not cheat on th~ Students know not to ch~ 0.25
#> 10 10 We have a lot of rain in June It was raining in June 0.25
由 reprex package (v0.2.0) 创建于 2018-08-17。