确保一列中的部门名称与另一列中的部门名称拼写相同
Make sure department names in one column are spelled like department names in another column
我有两个数据集。
一个是我目前正在处理的需要更改的数据(拼写错误),它看起来像这样:
df<-structure(list(username = c("hmaens", "pmgcann", "gsamse", "SCundan",
"kflower1", "ahazra"), Department = c("Hematology Oncology2",
"Pediatric Hematology Oncology", "Cancer Institute",
"Hematology Oncology Cancer InstituteClinical Research Center",
"Emergency Medicine Research", "Emergency Medicine Resaerch"),
`Access Control` = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes"
), `Organizational Unit` = structure(c(1L, 1L, 1L, 1L, 2L,
2L), .Label = c("Cancer Institute", "General Research"
), class = "factor"), ManagementGroup = c("Cancer Institute - Hematology Oncology",
"Cancer Institute - Pediatric Hematology Oncology",
"Cancer Institute - Cancer Institooote", "Cancer Institute - HematologyOncology Cancer Institute Clinical Research Center",
"General Research - Emergency Medicine Resaerch", "General Research - EmergencyMedicine Research"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
另一个数据框是管理组应该如何拼写的参考列表:
df2<-structure(list(Department = c("General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Infectious Diseases ",
"General Research - Clinical Research Center ", "General Research - Nephrology ",
"General Research - Pediatric Endocrinology; Metabolism ", "General Research - Pediatric Hematology\Oncology ",
"General Research - Radiation Therapy ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI ", "Cancer Institute - Neurosurgery - LCI ",
"Cancer Institute - Pediatric Hematology/Oncology-LCI ",
"Cancer Institute - Pediatric Hemophilia/Thrombosis Center - LCI ",
"Cancer Institute - Radiation Therapy - LCI ", "General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Diagnostic Imaging ",
"General Research - Emergency Medicine Research ", "General Research - Clinical Research Center ",
"General Research - Nephrology ", "General Research - Neurology ",
"Cancer Institute - Hematology/Oncology ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI ")), row.names = c(NA,
-24L), class = c("tbl_df", "tbl", "data.frame"))
我的问题是:有没有一种方法可以自动引用第二个数据框来 'correct' 单词拼写和批量更改列?
我意识到我可以用 methods used in this answer 单独修复拼写错误,例如我可以编写单独的代码行,说“不,resAErch 应该拼写为 'research'”...但是有没有办法让 R 在第二个数据框中查找 'nearest' 拼写并将其更改为该拼写?
换句话说,是否可以编写 R 代码来检查 df$Managementgroup 并注意到“Cancer Institute-Cancer Insitooote”与 df2$Department 中的“Cancer Institute - Cancer Institute”非常相似,然后修复拼写?
如果这有意义,理想情况下它也会包含第二个数据帧中的拼写和空格。
this answer using {fuzzyjoin} is relevant。祝你好运!
library(fuzzyjoin)
library(dplyr)
df <- structure(list(
username = c(
"hmaens", "pmgcann", "gsamse", "SCundan",
"kflower1", "ahazra"
),
Department = c(
"Hematology Oncology2",
"Pediatric Hematology Oncology", "Cancer Institute",
"Hematology Oncology Cancer InstituteClinical Research Center",
"Emergency Medicine Research", "Emergency Medicine Resaerch"
),
`Access Control` = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
`Organizational Unit` = structure(c(
1L, 1L, 1L, 1L, 2L,
2L
), .Label = c("Cancer Institute", "General Research"), class = "factor"),
ManagementGroup = c(
"Cancer Institute - Hematology Oncology",
"Cancer Institute - Pediatric Hematology Oncology",
"Cancer Institute - Cancer Institooote", "Cancer Institute - HematologyOncology Cancer Institute Clinical Research Center",
"General Research - Emergency Medicine Resaerch", "General Research - EmergencyMedicine Research"
)
), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(ManagementGroup = c(
"General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Infectious Diseases ",
"General Research - Clinical Research Center ", "General Research - Nephrology ",
"General Research - Pediatric Endocrinology; Metabolism ", "General Research - Pediatric Hematology\Oncology ",
"General Research - Radiation Therapy ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI ", "Cancer Institute - Neurosurgery - LCI ",
"Cancer Institute - Pediatric Hematology/Oncology-LCI ",
"Cancer Institute - Pediatric Hemophilia/Thrombosis Center - LCI ",
"Cancer Institute - Radiation Therapy - LCI ", "General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Diagnostic Imaging ",
"General Research - Emergency Medicine Research ", "General Research - Clinical Research Center ",
"General Research - Nephrology ", "General Research - Neurology ",
"Cancer Institute - Hematology/Oncology ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI "
)), row.names = c(
NA,
-24L
), class = c("tbl_df", "tbl", "data.frame"))
final_df <- stringdist_join(df, df2,
by = "ManagementGroup",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(ManagementGroup.x) %>%
slice_min(order_by = dist, n = 1) %>%
distinct()
由 reprex package (v2.0.1)
于 2022-04-05 创建
我有两个数据集。
一个是我目前正在处理的需要更改的数据(拼写错误),它看起来像这样:
df<-structure(list(username = c("hmaens", "pmgcann", "gsamse", "SCundan",
"kflower1", "ahazra"), Department = c("Hematology Oncology2",
"Pediatric Hematology Oncology", "Cancer Institute",
"Hematology Oncology Cancer InstituteClinical Research Center",
"Emergency Medicine Research", "Emergency Medicine Resaerch"),
`Access Control` = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes"
), `Organizational Unit` = structure(c(1L, 1L, 1L, 1L, 2L,
2L), .Label = c("Cancer Institute", "General Research"
), class = "factor"), ManagementGroup = c("Cancer Institute - Hematology Oncology",
"Cancer Institute - Pediatric Hematology Oncology",
"Cancer Institute - Cancer Institooote", "Cancer Institute - HematologyOncology Cancer Institute Clinical Research Center",
"General Research - Emergency Medicine Resaerch", "General Research - EmergencyMedicine Research"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
另一个数据框是管理组应该如何拼写的参考列表:
df2<-structure(list(Department = c("General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Infectious Diseases ",
"General Research - Clinical Research Center ", "General Research - Nephrology ",
"General Research - Pediatric Endocrinology; Metabolism ", "General Research - Pediatric Hematology\Oncology ",
"General Research - Radiation Therapy ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI ", "Cancer Institute - Neurosurgery - LCI ",
"Cancer Institute - Pediatric Hematology/Oncology-LCI ",
"Cancer Institute - Pediatric Hemophilia/Thrombosis Center - LCI ",
"Cancer Institute - Radiation Therapy - LCI ", "General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Diagnostic Imaging ",
"General Research - Emergency Medicine Research ", "General Research - Clinical Research Center ",
"General Research - Nephrology ", "General Research - Neurology ",
"Cancer Institute - Hematology/Oncology ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI ")), row.names = c(NA,
-24L), class = c("tbl_df", "tbl", "data.frame"))
我的问题是:有没有一种方法可以自动引用第二个数据框来 'correct' 单词拼写和批量更改列?
我意识到我可以用 methods used in this answer 单独修复拼写错误,例如我可以编写单独的代码行,说“不,resAErch 应该拼写为 'research'”...但是有没有办法让 R 在第二个数据框中查找 'nearest' 拼写并将其更改为该拼写?
换句话说,是否可以编写 R 代码来检查 df$Managementgroup 并注意到“Cancer Institute-Cancer Insitooote”与 df2$Department 中的“Cancer Institute - Cancer Institute”非常相似,然后修复拼写?
如果这有意义,理想情况下它也会包含第二个数据帧中的拼写和空格。
this answer using {fuzzyjoin} is relevant。祝你好运!
library(fuzzyjoin)
library(dplyr)
df <- structure(list(
username = c(
"hmaens", "pmgcann", "gsamse", "SCundan",
"kflower1", "ahazra"
),
Department = c(
"Hematology Oncology2",
"Pediatric Hematology Oncology", "Cancer Institute",
"Hematology Oncology Cancer InstituteClinical Research Center",
"Emergency Medicine Research", "Emergency Medicine Resaerch"
),
`Access Control` = c("Yes", "Yes", "Yes", "Yes", "Yes", "Yes"),
`Organizational Unit` = structure(c(
1L, 1L, 1L, 1L, 2L,
2L
), .Label = c("Cancer Institute", "General Research"), class = "factor"),
ManagementGroup = c(
"Cancer Institute - Hematology Oncology",
"Cancer Institute - Pediatric Hematology Oncology",
"Cancer Institute - Cancer Institooote", "Cancer Institute - HematologyOncology Cancer Institute Clinical Research Center",
"General Research - Emergency Medicine Resaerch", "General Research - EmergencyMedicine Research"
)
), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
df2 <- structure(list(ManagementGroup = c(
"General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Infectious Diseases ",
"General Research - Clinical Research Center ", "General Research - Nephrology ",
"General Research - Pediatric Endocrinology; Metabolism ", "General Research - Pediatric Hematology\Oncology ",
"General Research - Radiation Therapy ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI ", "Cancer Institute - Neurosurgery - LCI ",
"Cancer Institute - Pediatric Hematology/Oncology-LCI ",
"Cancer Institute - Pediatric Hemophilia/Thrombosis Center - LCI ",
"Cancer Institute - Radiation Therapy - LCI ", "General Research - Cardiology ",
"General Research - Dermatology Clinical Studies Unit ", "General Research - Diagnostic Imaging ",
"General Research - Emergency Medicine Research ", "General Research - Clinical Research Center ",
"General Research - Nephrology ", "General Research - Neurology ",
"Cancer Institute - Hematology/Oncology ", "Cancer Institute - Cancer Institute ",
"Cancer Institute - Neurology - LCI "
)), row.names = c(
NA,
-24L
), class = c("tbl_df", "tbl", "data.frame"))
final_df <- stringdist_join(df, df2,
by = "ManagementGroup",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(ManagementGroup.x) %>%
slice_min(order_by = dist, n = 1) %>%
distinct()
由 reprex package (v2.0.1)
于 2022-04-05 创建