确保一列中的部门名称与另一列中的部门名称拼写相同

Make sure department names in one column are spelled like department names in another column

我有两个数据集。

一个是我目前正在处理的需要更改的数据(拼写错误),它看起来像这样:

df<-structure(list(username = c("hmaens", "pmgcann", "gsamse", "SCundan", 
"kflower1", "ahazra"), Department = c("Hematology Oncology2", 
"Pediatric Hematology Oncology", "Cancer Institute", 
"Hematology Oncology Cancer InstituteClinical Research Center", 
"Emergency Medicine Research", "Emergency Medicine Resaerch"), 
    `Access Control` = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes"
    ), `Organizational Unit` = structure(c(1L, 1L, 1L, 1L, 2L, 
    2L), .Label = c("Cancer Institute", "General Research"
    ), class = "factor"), ManagementGroup = c("Cancer Institute - Hematology Oncology", 
    "Cancer Institute - Pediatric Hematology Oncology", 
    "Cancer Institute - Cancer Institooote", "Cancer Institute - HematologyOncology Cancer Institute Clinical Research Center", 
    "General Research - Emergency Medicine Resaerch", "General Research - EmergencyMedicine Research"
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

另一个数据框是管理组应该如何拼写的参考列表:

df2<-structure(list(Department = c("General Research - Cardiology ", 
"General Research - Dermatology Clinical Studies Unit ", "General Research - Infectious Diseases ", 
"General Research - Clinical Research Center ", "General Research - Nephrology ", 
"General Research - Pediatric Endocrinology; Metabolism ", "General Research - Pediatric Hematology\Oncology ", 
"General Research - Radiation Therapy ", "Cancer Institute - Cancer Institute ", 
"Cancer Institute - Neurology - LCI ", "Cancer Institute - Neurosurgery - LCI ", 
"Cancer Institute - Pediatric Hematology/Oncology-LCI ", 
"Cancer Institute - Pediatric Hemophilia/Thrombosis Center - LCI ", 
"Cancer Institute - Radiation Therapy - LCI ", "General Research - Cardiology ", 
"General Research - Dermatology Clinical Studies Unit ", "General Research - Diagnostic Imaging ", 
"General Research - Emergency Medicine Research ", "General Research - Clinical Research Center ", 
"General Research - Nephrology ", "General Research - Neurology ", 
"Cancer Institute - Hematology/Oncology ", "Cancer Institute - Cancer Institute ", 
"Cancer Institute - Neurology - LCI ")), row.names = c(NA, 
-24L), class = c("tbl_df", "tbl", "data.frame"))

我的问题是:有没有一种方法可以自动引用第二个数据框来 'correct' 单词拼写和批量更改列?

我意识到我可以用 methods used in this answer 单独修复拼写错误,例如我可以编写单独的代码行,说“不,resAErch 应该拼写为 'research'”...但是有没有办法让 R 在第二个数据框中查找 'nearest' 拼写并将其更改为该拼写?

换句话说,是否可以编写 R 代码来检查 df$Managementgroup 并注意到“Cancer Institute-Cancer Insitooote”与 df2$Department 中的“Cancer Institute - Cancer Institute”非常相似,然后修复拼写?

如果这有意义,理想情况下它也会包含第二个数据帧中的拼写和空格。

this answer using {fuzzyjoin} is relevant。祝你好运!

library(fuzzyjoin)
library(dplyr)

df <- structure(list(
  username = c(
    "hmaens", "pmgcann", "gsamse", "SCundan",
    "kflower1", "ahazra"
  ),
  Department = c(
    "Hematology Oncology2",
    "Pediatric Hematology Oncology", "Cancer Institute",
    "Hematology Oncology Cancer InstituteClinical Research Center",
    "Emergency Medicine Research", "Emergency Medicine Resaerch"
  ),
  `Access Control` = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
  `Organizational Unit` = structure(c(
    1L, 1L, 1L, 1L, 2L,
    2L
  ), .Label = c("Cancer Institute", "General Research"), class = "factor"),
  ManagementGroup = c(
    "Cancer Institute - Hematology Oncology",
    "Cancer Institute - Pediatric Hematology Oncology",
    "Cancer Institute - Cancer Institooote", "Cancer Institute - HematologyOncology Cancer Institute Clinical Research Center",
    "General Research - Emergency Medicine Resaerch", "General Research - EmergencyMedicine Research"
  )
), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

df2 <- structure(list(ManagementGroup = c(
  "General Research - Cardiology ",
  "General Research - Dermatology Clinical Studies Unit ", "General Research - Infectious Diseases ",
  "General Research - Clinical Research Center ", "General Research - Nephrology ",
  "General Research - Pediatric Endocrinology; Metabolism ", "General Research - Pediatric Hematology\Oncology ",
  "General Research - Radiation Therapy ", "Cancer Institute - Cancer Institute ",
  "Cancer Institute - Neurology - LCI ", "Cancer Institute - Neurosurgery - LCI ",
  "Cancer Institute - Pediatric Hematology/Oncology-LCI ",
  "Cancer Institute - Pediatric Hemophilia/Thrombosis Center - LCI ",
  "Cancer Institute - Radiation Therapy - LCI ", "General Research - Cardiology ",
  "General Research - Dermatology Clinical Studies Unit ", "General Research - Diagnostic Imaging ",
  "General Research - Emergency Medicine Research ", "General Research - Clinical Research Center ",
  "General Research - Nephrology ", "General Research - Neurology ",
  "Cancer Institute - Hematology/Oncology ", "Cancer Institute - Cancer Institute ",
  "Cancer Institute - Neurology - LCI "
)), row.names = c(
  NA,
  -24L
), class = c("tbl_df", "tbl", "data.frame"))


final_df <- stringdist_join(df, df2,
  by = "ManagementGroup",
  mode = "left",
  ignore_case = FALSE,
  method = "jw",
  max_dist = 99,
  distance_col = "dist") %>%
  group_by(ManagementGroup.x) %>%
  slice_min(order_by = dist, n = 1) %>%
  distinct()

reprex package (v2.0.1)

于 2022-04-05 创建