替换矩阵中的重复项

Question

我有以下测试代码给你：

####TESTING HERE
test = tibble::tribble(
                          ~Name1,           ~Name2,          ~Name3,
                   "Paul Walker",    "Paule Walkr",   "Heiko Knaup",
                "Ferdinand Bass", "Ferdinand Base", "Michael Herre"
                )

library(stringdist)
output <- list()
for (row in 1:nrow(test)) 
{
  codephon = phonetic(test[row,], method = c("soundex"), useBytes = FALSE)
  output[[row]] <- codephon
}

#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix, ",", 3)
#> Error in str_split_fixed(phoneticmatrix, ",", 3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]', '', soundexspalten)
#> Error in gsub("[()c\"]", "", soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000", "", soundexmatrix0)
#> Error in gsub("0000", "", soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden

^{由 reprex package (v2.0.0)}

于 2021-06-03 创建

现在我想 !!!用字符串“DUPLICATE”替换 soundexmatrix1 中的所有重复项，这样 Matrix 的维度保持不变，并且可以立即看到所有重复项。

有什么办法吗？感谢您的帮助！

Answer 1

要检查每一行中的重复项（请参阅更新），这应该可以达到您想要的效果，并且更简洁：

# Feel free to load the packages you're using.
# library(stringdist)
# library(tibble)

test <- tibble::tribble(
  ~Name1,           ~Name2,           ~Name3,
  "Paul Walker",    "Paule Walkr",    "Heiko Knaup",
  "Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)

# Get phonetic codes cleanly.
result <- as.matrix(apply(X = test, MARGIN = 2,
                          FUN = stringdist::phonetic, method = c("soundex"), useBytes = FALSE))

# Find all blank codes ("0000").
blanks <- result == "0000"

# # Find all duplicates, as compared across ENTIRE matrix; ignore blank codes.
# all_duplicates <- !blanks & duplicated(result, MARGIN = 0)

# Find duplicates, as compared within EACH ROW; ignore blank codes.
row_duplicates <- !blanks & t(apply(X = result, MARGIN = 1, FUN = duplicated))

# Replace blank codes ("0000") with blanks (""); and replace duplicates (found
# within rows) with "DUPLICATE".
result[blanks] <- ""
result[row_duplicates] <- "DUPLICATE"

# View result.
result

result应该是下面的矩阵：

     Name1  Name2       Name3 
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"

更新

根据发帖者的，我更改了代码以仅比较每行内的重复项，而不是整个 result 矩阵。现在，test 数据集如

test <- tibble::tribble(
    ~Name1,           ~Name2,           ~Name3,
    "Paul Walker",    "Paule Walkr",    "Heiko Knaup",
    "Ferdinand Bass", "Ferdinand Base", "Michael Herre",
    "",               "01234 56789",    "Heiko Knaup"
# | ^^              | ^^^^^^^^^^^^^   | ^^^^^^^^^^^^^                   |
# | Coded as "0000" | Coded as "0000" | Duplicate in matrix, NOT in row |
)

会给result一个赞

     Name1  Name2       Name3 
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
[3,] ""     ""          "H225"

替换矩阵中的重复项

Replace duplicates in matrix

r

matrix

duplicates

stringdist

更新