如何在 R 的列中合并具有特定字符串匹配的两个数据框?
How to merge two data frames with specific string match in columns in R?
我有两个数据帧 data1
和 data2
,它们包含如下信息:
dput(data1)
structure(list(ProfName = c("Hua (Christine) Xin", "Dereck Barr-Pulliam",
"Lisa M. Blum", "Russell Williamson", "William D. Stout", "Michael F. Wade",
"Sheila A. Johnston", "Julie Huang", "Alan Attaway", "Alan Levitan",
"Benjamin P. Foster", "Carolyn M. Callahan"), Title = c(" PhD",
" PhD", " LLM", " PhD", " PhD", " CPA", " MS", " PhD", " PhD",
" PhD", " PhD", " PhD"), Profession = c("Assistant Professor",
"Assistant Professor", "Instructor", "Assistant Professor", "Associate Professor and Director",
"Instructor", "Instructor", "Associate Professor", "Professor",
"Professor", "Professor", "Brown-Forman Professor of Accountancy"
)), row.names = c(8L, 18L, 25L, 36L, 49L, 50L, 56L, 69L, 71L,
82L, 88L, 89L), class = "data.frame")
如下所示:
dput(data2)
structure(list(ProfName = c("Blandford, K ", "Okafor, A ",
"Johnston, S ", "Rolen, R ", "Attaway, A ", "Xin, H ",
"Huang, Y ", "Stout, W ", "Williamson, R ", "Callahan, C ",
"Foster, B ", "Blum, L ", "Levitan, A ", "Barr-Pulliam, D ",
"Wade, M ")), row.names = c(NA, -15L), class = "data.frame")
data2
如下所示:
我想合并两个数据框,但名称看起来不同。只有一个特定的字符串在具有列 ProfName
的两个数据帧之间匹配。数据应该合并,如果名称没有任何信息,它应该是空的。如果他们在 Title
和 Profession
列中没有任何信息,ProfName
和 New
列应该具有相同的名称。
我尝试使用 merge
,但没有提供所需的输出。
merge(data1, data2, by="ProfName", all.x=TRUE, all.y = TRUE)
输出应如下所示:
这个有用吗:
> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\s(.*)$', '\2', ProfName))) %>%
+ right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(, .)', '\1',ProfName))) %>% rename(new = ProfName)) %>%
+ mutate(ProfName = coalesce(ProfName, new)) %>%
+ select(-secName)
Joining, by = "secName"
ProfName Title Profession new
1 Hua (Christine) Xin PhD Assistant Professor Xin, H
2 Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
3 Lisa M. Blum LLM Instructor Blum, L
4 Russell Williamson PhD Assistant Professor Williamson, R
5 William D. Stout PhD Associate Professor and Director Stout, W
6 Michael F. Wade CPA Instructor Wade, M
7 Sheila A. Johnston MS Instructor Johnston, S
8 Julie Huang PhD Associate Professor Huang, Y
9 Alan Attaway PhD Professor Attaway, A
10 Alan Levitan PhD Professor Levitan, A
11 Benjamin P. Foster PhD Professor Foster, B
12 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
13 Blandford, K <NA> <NA> Blandford, K
14 Okafor, A <NA> <NA> Okafor, A
15 Rolen, R <NA> <NA> Rolen, R
>
使用的数据:
> df
ProfName Title Profession
8 Hua (Christine) Xin PhD Assistant Professor
18 Dereck Barr-Pulliam PhD Assistant Professor
25 Lisa M. Blum LLM Instructor
36 Russell Williamson PhD Assistant Professor
49 William D. Stout PhD Associate Professor and Director
50 Michael F. Wade CPA Instructor
56 Sheila A. Johnston MS Instructor
69 Julie Huang PhD Associate Professor
71 Alan Attaway PhD Professor
82 Alan Levitan PhD Professor
88 Benjamin P. Foster PhD Professor
89 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy
> df1
ProfName
1 Blandford, K
2 Okafor, A
3 Johnston, S
4 Rolen, R
5 Attaway, A
6 Xin, H
7 Huang, Y
8 Stout, W
9 Williamson, R
10 Callahan, C
11 Foster, B
12 Blum, L
13 Levitan, A
14 Barr-Pulliam, D
15 Wade, M
>
这是一个简单的解决方案:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\-]+"))
df <- merge(data1, data2, all.y = TRUE, by = "lname")
head(df)
# lname ProfName.x Title Profession # ProfName.y
# 1 Attaway Alan Attaway PhD Professor Attaway, A
# 2 Barr-Pulliam Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
# 3 Blandford <NA> <NA> <NA> Blandford, K
# 4 Blum Lisa M. Blum LLM Instructor Blum, L
# 5 Callahan Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
# 6 Foster Benjamin P. Foster PhD Professor Foster, B
我有两个数据帧 data1
和 data2
,它们包含如下信息:
dput(data1)
structure(list(ProfName = c("Hua (Christine) Xin", "Dereck Barr-Pulliam",
"Lisa M. Blum", "Russell Williamson", "William D. Stout", "Michael F. Wade",
"Sheila A. Johnston", "Julie Huang", "Alan Attaway", "Alan Levitan",
"Benjamin P. Foster", "Carolyn M. Callahan"), Title = c(" PhD",
" PhD", " LLM", " PhD", " PhD", " CPA", " MS", " PhD", " PhD",
" PhD", " PhD", " PhD"), Profession = c("Assistant Professor",
"Assistant Professor", "Instructor", "Assistant Professor", "Associate Professor and Director",
"Instructor", "Instructor", "Associate Professor", "Professor",
"Professor", "Professor", "Brown-Forman Professor of Accountancy"
)), row.names = c(8L, 18L, 25L, 36L, 49L, 50L, 56L, 69L, 71L,
82L, 88L, 89L), class = "data.frame")
如下所示:
dput(data2)
structure(list(ProfName = c("Blandford, K ", "Okafor, A ",
"Johnston, S ", "Rolen, R ", "Attaway, A ", "Xin, H ",
"Huang, Y ", "Stout, W ", "Williamson, R ", "Callahan, C ",
"Foster, B ", "Blum, L ", "Levitan, A ", "Barr-Pulliam, D ",
"Wade, M ")), row.names = c(NA, -15L), class = "data.frame")
data2
如下所示:
我想合并两个数据框,但名称看起来不同。只有一个特定的字符串在具有列 ProfName
的两个数据帧之间匹配。数据应该合并,如果名称没有任何信息,它应该是空的。如果他们在 Title
和 Profession
列中没有任何信息,ProfName
和 New
列应该具有相同的名称。
我尝试使用 merge
,但没有提供所需的输出。
merge(data1, data2, by="ProfName", all.x=TRUE, all.y = TRUE)
输出应如下所示:
这个有用吗:
> library(dplyr)
> df %>% mutate(secName = trimws(gsub('(.*)\s(.*)$', '\2', ProfName))) %>%
+ right_join(df1 %>% mutate(secName = trimws(gsub('(.*)(, .)', '\1',ProfName))) %>% rename(new = ProfName)) %>%
+ mutate(ProfName = coalesce(ProfName, new)) %>%
+ select(-secName)
Joining, by = "secName"
ProfName Title Profession new
1 Hua (Christine) Xin PhD Assistant Professor Xin, H
2 Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
3 Lisa M. Blum LLM Instructor Blum, L
4 Russell Williamson PhD Assistant Professor Williamson, R
5 William D. Stout PhD Associate Professor and Director Stout, W
6 Michael F. Wade CPA Instructor Wade, M
7 Sheila A. Johnston MS Instructor Johnston, S
8 Julie Huang PhD Associate Professor Huang, Y
9 Alan Attaway PhD Professor Attaway, A
10 Alan Levitan PhD Professor Levitan, A
11 Benjamin P. Foster PhD Professor Foster, B
12 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
13 Blandford, K <NA> <NA> Blandford, K
14 Okafor, A <NA> <NA> Okafor, A
15 Rolen, R <NA> <NA> Rolen, R
>
使用的数据:
> df
ProfName Title Profession
8 Hua (Christine) Xin PhD Assistant Professor
18 Dereck Barr-Pulliam PhD Assistant Professor
25 Lisa M. Blum LLM Instructor
36 Russell Williamson PhD Assistant Professor
49 William D. Stout PhD Associate Professor and Director
50 Michael F. Wade CPA Instructor
56 Sheila A. Johnston MS Instructor
69 Julie Huang PhD Associate Professor
71 Alan Attaway PhD Professor
82 Alan Levitan PhD Professor
88 Benjamin P. Foster PhD Professor
89 Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy
> df1
ProfName
1 Blandford, K
2 Okafor, A
3 Johnston, S
4 Rolen, R
5 Attaway, A
6 Xin, H
7 Huang, Y
8 Stout, W
9 Williamson, R
10 Callahan, C
11 Foster, B
12 Blum, L
13 Levitan, A
14 Barr-Pulliam, D
15 Wade, M
>
这是一个简单的解决方案:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\-]+$"))
data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\-]+"))
df <- merge(data1, data2, all.y = TRUE, by = "lname")
head(df)
# lname ProfName.x Title Profession # ProfName.y
# 1 Attaway Alan Attaway PhD Professor Attaway, A
# 2 Barr-Pulliam Dereck Barr-Pulliam PhD Assistant Professor Barr-Pulliam, D
# 3 Blandford <NA> <NA> <NA> Blandford, K
# 4 Blum Lisa M. Blum LLM Instructor Blum, L
# 5 Callahan Carolyn M. Callahan PhD Brown-Forman Professor of Accountancy Callahan, C
# 6 Foster Benjamin P. Foster PhD Professor Foster, B