替换矩阵中的重复项
Replace duplicates in matrix
我有以下测试代码给你:
####TESTING HERE
test = tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)
library(stringdist)
output <- list()
for (row in 1:nrow(test))
{
codephon = phonetic(test[row,], method = c("soundex"), useBytes = FALSE)
output[[row]] <- codephon
}
#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix, ",", 3)
#> Error in str_split_fixed(phoneticmatrix, ",", 3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]', '', soundexspalten)
#> Error in gsub("[()c\"]", "", soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000", "", soundexmatrix0)
#> Error in gsub("0000", "", soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden
由 reprex package (v2.0.0)
于 2021-06-03 创建
现在我想 !!!用字符串“DUPLICATE”替换 soundexmatrix1 中的所有重复项,这样 Matrix 的维度保持不变,并且可以立即看到所有重复项。
有什么办法吗?
感谢您的帮助!
要检查每一行中的重复项(请参阅 更新),这应该可以达到您想要的效果,并且更简洁:
# Feel free to load the packages you're using.
# library(stringdist)
# library(tibble)
test <- tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)
# Get phonetic codes cleanly.
result <- as.matrix(apply(X = test, MARGIN = 2,
FUN = stringdist::phonetic, method = c("soundex"), useBytes = FALSE))
# Find all blank codes ("0000").
blanks <- result == "0000"
# # Find all duplicates, as compared across ENTIRE matrix; ignore blank codes.
# all_duplicates <- !blanks & duplicated(result, MARGIN = 0)
# Find duplicates, as compared within EACH ROW; ignore blank codes.
row_duplicates <- !blanks & t(apply(X = result, MARGIN = 1, FUN = duplicated))
# Replace blank codes ("0000") with blanks (""); and replace duplicates (found
# within rows) with "DUPLICATE".
result[blanks] <- ""
result[row_duplicates] <- "DUPLICATE"
# View result.
result
result
应该是下面的矩阵:
Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
更新
根据发帖者的 ,我更改了代码以仅比较每行内的重复项,而不是整个 result
矩阵。现在,test
数据集如
test <- tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre",
"", "01234 56789", "Heiko Knaup"
# | ^^ | ^^^^^^^^^^^^^ | ^^^^^^^^^^^^^ |
# | Coded as "0000" | Coded as "0000" | Duplicate in matrix, NOT in row |
)
会给result
一个赞
Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
[3,] "" "" "H225"
我有以下测试代码给你:
####TESTING HERE
test = tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)
library(stringdist)
output <- list()
for (row in 1:nrow(test))
{
codephon = phonetic(test[row,], method = c("soundex"), useBytes = FALSE)
output[[row]] <- codephon
}
#building the matrix with soundex input
phoneticmatrix = matrix(output)
soundexspalten=str_split_fixed(phoneticmatrix, ",", 3)
#> Error in str_split_fixed(phoneticmatrix, ",", 3): konnte Funktion "str_split_fixed" nicht finden
soundexmatrix0 = gsub('[()c"]', '', soundexspalten)
#> Error in gsub("[()c\"]", "", soundexspalten): Objekt 'soundexspalten' nicht gefunden
soundexmatrix1 = gsub("0000", "", soundexmatrix0)
#> Error in gsub("0000", "", soundexmatrix0): Objekt 'soundexmatrix0' nicht gefunden
由 reprex package (v2.0.0)
于 2021-06-03 创建现在我想 !!!用字符串“DUPLICATE”替换 soundexmatrix1 中的所有重复项,这样 Matrix 的维度保持不变,并且可以立即看到所有重复项。
有什么办法吗? 感谢您的帮助!
要检查每一行中的重复项(请参阅 更新),这应该可以达到您想要的效果,并且更简洁:
# Feel free to load the packages you're using.
# library(stringdist)
# library(tibble)
test <- tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre"
)
# Get phonetic codes cleanly.
result <- as.matrix(apply(X = test, MARGIN = 2,
FUN = stringdist::phonetic, method = c("soundex"), useBytes = FALSE))
# Find all blank codes ("0000").
blanks <- result == "0000"
# # Find all duplicates, as compared across ENTIRE matrix; ignore blank codes.
# all_duplicates <- !blanks & duplicated(result, MARGIN = 0)
# Find duplicates, as compared within EACH ROW; ignore blank codes.
row_duplicates <- !blanks & t(apply(X = result, MARGIN = 1, FUN = duplicated))
# Replace blank codes ("0000") with blanks (""); and replace duplicates (found
# within rows) with "DUPLICATE".
result[blanks] <- ""
result[row_duplicates] <- "DUPLICATE"
# View result.
result
result
应该是下面的矩阵:
Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
更新
根据发帖者的 result
矩阵。现在,test
数据集如
test <- tibble::tribble(
~Name1, ~Name2, ~Name3,
"Paul Walker", "Paule Walkr", "Heiko Knaup",
"Ferdinand Bass", "Ferdinand Base", "Michael Herre",
"", "01234 56789", "Heiko Knaup"
# | ^^ | ^^^^^^^^^^^^^ | ^^^^^^^^^^^^^ |
# | Coded as "0000" | Coded as "0000" | Duplicate in matrix, NOT in row |
)
会给result
一个赞
Name1 Name2 Name3
[1,] "P442" "DUPLICATE" "H225"
[2,] "F635" "DUPLICATE" "M246"
[3,] "" "" "H225"