从列中自动提取拼写不匹配的字符串并在 R 中替换它们

Question

我有一个庞大的数据集，类似于下面发布的列

NameofEmployee <- c(x, y, z, a)
Region <- c("Pune", "Orissa", "Orisa", "Poone")

如您所见，在 Region 列中，区域 "Pune" 有两种不同的拼写方式 - 即 "Pune" 和 "Poona"。

类似地，"Orissa"拼写为"Orissa"和"Orisa"。

我有多个实际上相同但拼写方式不同的地区。这样我分析数据的时候就会出问题

我希望能够在 R 的帮助下自动获得这些拼写不匹配的列表。
我还想自动用正确的拼写替换拼写。

Answer 1

我认为您应该使用拼音代码来确定哪些拼写接近哪些拼写。

soundex 算法是一个不错的选择，它在几个 R 包中实现。我将使用包 stringdist.

library(stringdist)

Region <- c("Pune", "Orissa", "Orisa", "Poone")
phonetic(Region)
#[1] "P500" "O620" "O620" "P500"

如您所见，Region[1] 和 Region[4] 具有相同的 soundex 代码。 Region[2] 和 Region[3] 也是如此。

Answer 2

拼写错误很难被发现，在处理名字时更容易发生。

我建议使用一些字符串距离来检测两个词的接近程度。您可以使用 tidystringdist 轻松执行此操作，它允许从向量中获取所有组合，然后执行所有 available string distance methods from stringdist:

Region <- c("Pune", "Orissa", "Orisa", "Poone")

library(tidystringdist)
library(magrittr)

tidy_comb_all(Region) %>%
  tidy_stringdist()
#> # A tibble: 6 x 12
#>   V1     V2      osa    lv    dl hamming   lcs qgram cosine jaccard     jw
#> * <chr>  <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>  <dbl>
#> 1 Pune   Oris…     6     6     6     Inf    10    10 1          1   1     
#> 2 Pune   Orisa     5     5     5     Inf     9     9 1          1   1     
#> 3 Pune   Poone     2     2     2     Inf     3     3 0.433      0.4 0.217 
#> 4 Orissa Orisa     1     1     1     Inf     1     1 0.0513     0   0.0556
#> 5 Orissa Poone     6     6     6     Inf    11    11 1          1   1     
#> 6 Orisa  Poone     5     5     5       5    10    10 1          1   1     
#> # ... with 1 more variable: soundex <dbl>

由 reprex package (v0.2.0) 创建于 2018-07-24。

如您所见，Pune 和 Poone 的 osa、lv 和 dl 距离为 2，而 Orisa / Orissa 的距离为 1，表明它们的拼写非常接近。

当你确定了这些之后，你就可以进行替换了。

从列中自动提取拼写不匹配的字符串并在 R 中替换它们

Automatically extracting strings with mismatched spellings from a column and replacing them in R

string

r

text-analysis