如何在 R 的列中合并具有特定字符串匹配的两个数据框?

How to merge two data frames with specific string match in columns in R?

我有两个数据帧 data1data2,它们包含如下信息:

dput(data1)

structure(list(ProfName = c("Hua (Christine) Xin", "Dereck Barr-Pulliam", 
"Lisa M. Blum", "Russell  Williamson", "William D. Stout", "Michael F. Wade", 
"Sheila A.  Johnston", "Julie Huang", "Alan Attaway", "Alan Levitan", 
"Benjamin P. Foster", "Carolyn M.  Callahan"), Title = c(" PhD", 
" PhD", " LLM", " PhD", " PhD", " CPA", " MS", " PhD", " PhD", 
" PhD", " PhD", " PhD"), Profession = c("Assistant Professor", 
"Assistant Professor", "Instructor", "Assistant Professor", "Associate Professor and Director", 
"Instructor", "Instructor", "Associate Professor", "Professor", 
"Professor", "Professor", "Brown-Forman Professor of Accountancy"
)), row.names = c(8L, 18L, 25L, 36L, 49L, 50L, 56L, 69L, 71L, 
82L, 88L, 89L), class = "data.frame")

如下所示:

dput(data2)

structure(list(ProfName = c("Blandford, K     ", "Okafor, A     ", 
"Johnston, S     ", "Rolen, R     ", "Attaway, A     ", "Xin, H     ", 
"Huang, Y     ", "Stout, W     ", "Williamson, R     ", "Callahan, C     ", 
"Foster, B     ", "Blum, L     ", "Levitan, A     ", "Barr-Pulliam, D     ", 
"Wade, M     ")), row.names = c(NA, -15L), class = "data.frame")

data2 如下所示:

我想合并两个数据框,但名称看​​起来不同。只有一个特定的字符串在具有列 ProfName 的两个数据帧之间匹配。数据应该合并,如果名称没有任何信息,它应该是空的。如果他们在 TitleProfession 列中没有任何信息,ProfNameNew 列应该具有相同的名称。

我尝试使用 merge,但没有提供所需的输出。

merge(data1, data2, by="ProfName", all.x=TRUE, all.y = TRUE)

输出应如下所示:

这个有用吗:

> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\s(.*)$', '\2', ProfName))) %>% 
+   right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(, .)', '\1',ProfName))) %>% rename(new = ProfName)) %>% 
+   mutate(ProfName = coalesce(ProfName, new)) %>% 
+   select(-secName)
Joining, by = "secName"
               ProfName Title                            Profession                  new
1   Hua (Christine) Xin   PhD                   Assistant Professor          Xin, H     
2   Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam, D     
3          Lisa M. Blum   LLM                            Instructor         Blum, L     
4   Russell  Williamson   PhD                   Assistant Professor   Williamson, R     
5      William D. Stout   PhD      Associate Professor and Director        Stout, W     
6       Michael F. Wade   CPA                            Instructor         Wade, M     
7   Sheila A.  Johnston    MS                            Instructor     Johnston, S     
8           Julie Huang   PhD                   Associate Professor        Huang, Y     
9          Alan Attaway   PhD                             Professor      Attaway, A     
10         Alan Levitan   PhD                             Professor      Levitan, A     
11   Benjamin P. Foster   PhD                             Professor       Foster, B     
12 Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan, C     
13    Blandford, K       <NA>                                  <NA>    Blandford, K     
14       Okafor, A       <NA>                                  <NA>       Okafor, A     
15        Rolen, R       <NA>                                  <NA>        Rolen, R     
> 

使用的数据:

> df
               ProfName Title                            Profession
8   Hua (Christine) Xin   PhD                   Assistant Professor
18  Dereck Barr-Pulliam   PhD                   Assistant Professor
25         Lisa M. Blum   LLM                            Instructor
36  Russell  Williamson   PhD                   Assistant Professor
49     William D. Stout   PhD      Associate Professor and Director
50      Michael F. Wade   CPA                            Instructor
56  Sheila A.  Johnston    MS                            Instructor
69          Julie Huang   PhD                   Associate Professor
71         Alan Attaway   PhD                             Professor
82         Alan Levitan   PhD                             Professor
88   Benjamin P. Foster   PhD                             Professor
89 Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy
> df1
               ProfName
1     Blandford, K     
2        Okafor, A     
3      Johnston, S     
4         Rolen, R     
5       Attaway, A     
6           Xin, H     
7         Huang, Y     
8         Stout, W     
9    Williamson, R     
10     Callahan, C     
11       Foster, B     
12         Blum, L     
13      Levitan, A     
14 Barr-Pulliam, D     
15         Wade, M     
> 

这是一个简单的解决方案:

library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)

data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\-]+"))

df <- merge(data1, data2, all.y = TRUE, by = "lname")

head(df)

#          lname           ProfName.x Title                            Profession           # ProfName.y
# 1      Attaway         Alan Attaway   PhD                             Professor      Attaway, A     
# 2 Barr-Pulliam  Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam, D     
# 3    Blandford                 <NA>  <NA>                                  <NA>    Blandford, K     
# 4         Blum         Lisa M. Blum   LLM                            Instructor         Blum, L     
# 5     Callahan Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan, C     
# 6       Foster   Benjamin P. Foster   PhD                             Professor       Foster, B