使用 stringdist 查找多个单词的匹配项
Finding matches for multiple words with stringdist
我有测试数据如下。我正在尝试使用 stringdist
为单词向量查找(接近)匹配项,因为实际数据库很大:
library(stringdist)
test_data <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Some", "Anything",
"Nothing", "Someth.", "Anything", "Not", "Something", "Anything",
"None", "Some", "Anything", "Nothing", "Someth.", "Anything",
"Not", "Something", "Anything", "None", "Some", "Anything", "Nothing",
"Someth.", "Anything", "Not", "Something", "Anything", "None"
), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 27 x 4
Province Year Municipality `Other Values`
<dbl> <dbl> <chr> <dbl>
1 1 2000 Some 0.41
2 1 2000 Anything 0.42
3 1 2000 Nothing 0.34
4 1 2001 Someth. 0.47
5 1 2001 Anything 0.0600
6 1 2001 Not 0.8
7 1 2002 Something 0.14
8 1 2002 Anything 0.15
9 1 2002 None 0.01
10 2 2000 Some 0.41
# ... with 17 more rows
我试过 运行:
test_match_out <- amatch(c("Anything","Something"),test_data[,3],maxDist=2)
编辑:
根据 zx8754 的评论,我尝试了:
test_match_out <- amatch(c("Anything","Something"),test_data[[3]],maxDist=2)
并且:
test_match_out <- amatch(c("Anything","Something"),test_data$Municipality,maxDist=2)
我的印象是前一行 (amatch
) 会给我一些类似于索引向量的东西,其中会有匹配项。但它只是给了我一个带有两个 NA
值的向量。我是不是误解了 amatch
的作用,还是语法有问题?
我想获取 values
匹配的 amatch
和匹配的单词。
期望的输出:
test_data_2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Some", "Anything",
"Nothing", "Someth.", "Anything", "Not", "Something", "Anything",
"None", "Some", "Anything", "Nothing", "Someth.", "Anything",
"Not", "Something", "Anything", "None", "Some", "Anything", "Nothing",
"Someth.", "Anything", "Not", "Something", "Anything", "None"
), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01), `Matched Values` = c(NA, 0.42, NA, NA, 0.06000,
NA, 0.14, 0.15, NA, NA, 0.42, NA, NA, 0.0600000000000001,
NA, 0.14, 0.15, NA, NA, 0.42, NA, NA, 0.0600000000000001,
NA, 0.14, 0.15, NA), `Matched Values` = c(NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA, NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA, NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA)), row.names = c(NA, -27L), class = c("tbl_df",
"tbl", "data.frame"))
获取匹配的索引,然后更新所有匹配的行:
ix <- amatch(c("Anything","Something"), test_data[[ 3 ]], maxDist = 2)
# [1] 2 7
ifelse(test_data$Municipality %in% test_data$Municipality[ ix ],
test_data$`Other Values`, NA)
# [1] NA 0.42 NA NA 0.06 NA 0.14 0.15 NA NA 0.42
# [12] NA NA 0.06 NA 0.14 0.15 NA NA 0.42 NA NA
# [23] 0.06 NA 0.14 0.15 NA
我有测试数据如下。我正在尝试使用 stringdist
为单词向量查找(接近)匹配项,因为实际数据库很大:
library(stringdist)
test_data <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Some", "Anything",
"Nothing", "Someth.", "Anything", "Not", "Something", "Anything",
"None", "Some", "Anything", "Nothing", "Someth.", "Anything",
"Not", "Something", "Anything", "None", "Some", "Anything", "Nothing",
"Someth.", "Anything", "Not", "Something", "Anything", "None"
), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 27 x 4
Province Year Municipality `Other Values`
<dbl> <dbl> <chr> <dbl>
1 1 2000 Some 0.41
2 1 2000 Anything 0.42
3 1 2000 Nothing 0.34
4 1 2001 Someth. 0.47
5 1 2001 Anything 0.0600
6 1 2001 Not 0.8
7 1 2002 Something 0.14
8 1 2002 Anything 0.15
9 1 2002 None 0.01
10 2 2000 Some 0.41
# ... with 17 more rows
我试过 运行:
test_match_out <- amatch(c("Anything","Something"),test_data[,3],maxDist=2)
编辑:
根据 zx8754 的评论,我尝试了:
test_match_out <- amatch(c("Anything","Something"),test_data[[3]],maxDist=2)
并且:
test_match_out <- amatch(c("Anything","Something"),test_data$Municipality,maxDist=2)
我的印象是前一行 (amatch
) 会给我一些类似于索引向量的东西,其中会有匹配项。但它只是给了我一个带有两个 NA
值的向量。我是不是误解了 amatch
的作用,还是语法有问题?
我想获取 values
匹配的 amatch
和匹配的单词。
期望的输出:
test_data_2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000,
2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000,
2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001,
2001, 2002, 2002, 2002), Municipality = c("Some", "Anything",
"Nothing", "Someth.", "Anything", "Not", "Something", "Anything",
"None", "Some", "Anything", "Nothing", "Someth.", "Anything",
"Not", "Something", "Anything", "None", "Some", "Anything", "Nothing",
"Someth.", "Anything", "Not", "Something", "Anything", "None"
), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001,
0.8, 0.14, 0.15, 0.01), `Matched Values` = c(NA, 0.42, NA, NA, 0.06000,
NA, 0.14, 0.15, NA, NA, 0.42, NA, NA, 0.0600000000000001,
NA, 0.14, 0.15, NA, NA, 0.42, NA, NA, 0.0600000000000001,
NA, 0.14, 0.15, NA), `Matched Values` = c(NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA, NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA, NA, "Anything", NA, NA, "Anything",
NA, "Something", "Anything", NA)), row.names = c(NA, -27L), class = c("tbl_df",
"tbl", "data.frame"))
获取匹配的索引,然后更新所有匹配的行:
ix <- amatch(c("Anything","Something"), test_data[[ 3 ]], maxDist = 2)
# [1] 2 7
ifelse(test_data$Municipality %in% test_data$Municipality[ ix ],
test_data$`Other Values`, NA)
# [1] NA 0.42 NA NA 0.06 NA 0.14 0.15 NA NA 0.42
# [12] NA NA 0.06 NA 0.14 0.15 NA NA 0.42 NA NA
# [23] 0.06 NA 0.14 0.15 NA