使用具有相同字符串但顺序不同的列合并两个 data.frame
merge two data.frame using a column with the same strings but in different order
我正在尝试使用包含字符串的列合并两个 data.frames。两列中的字符串是名称,不幸的是,它们的顺序不同。在下面的示例中,df_1
中的姓名结构为“name”+“midname”+“surname1”+“surname2”,而在 df_2
中的结构为“surname1”+“surname2”+“name” "+"中间名".
我首先尝试使用这些名称进行 fuzzy merge。但是,它并没有解决问题,因为完全不同的名称之间仍然存在非零匹配。此外,定义一个可以定义名称何时完全不同于另一个名称的切割点也很重要。我还期望具有相反顺序的名称之间的相似度更高(即,(名称+中名)+(姓氏1+姓氏2)以不同的顺序)。
您是否有更好的方法以不同的顺序使用这些名称合并两个 data.frame?提前致谢。
# "name"+"midname"+"surname1"+"surname2
df_1<- read.table(header = T,sep = "\t", text = "
name
Tetsurō Shoyo Hinata Kuroo
Kōtarō Tobio Kageyama Bokuto
Wakatoshi Daichi Sawamura Ushijima
Tōru Tsukishima Oikawa
Yūji Azumane Terushima
Kenma Kozume
")
# "surname1"+"surname2"+"name"+"midname".
df_2<- read.table(header = T,sep = "\t", text = "
name
Hinata Kuroo Tetsurō Shoyo
Kageyama Bokuto Kōtarō Tobio
Sawamura Ushijima Wakatoshi Daichi
Tsukishima Oikawa Tōru
Azumane Terushima Yūji
Kiyoomi Sakusa
")
library(fuzzyjoin); library(dplyr);
stringdist_join(df_1, df_2,
by = "name",
mode = "inner",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
slice_min(order_by = dist, n = 1)
结果
# A tibble: 6 x 3
# Groups: name.x [6]
name.x name.y dist
<chr> <chr> <dbl>
1 Kenma Kozume "Azumane Terushima Yuji " 0.416
2 Kotaro Tobio Kageyama Bokuto "Kageyama Bokuto Kotaro Tobio" 0.241
3 Tetsuro Shoyo Hinata Kuroo "Kageyama Bokuto Kotaro Tobio" 0.351
4 Toru Tsukishima Oikawa "Tsukishima Oikawa Toru " 0.302
5 Wakatoshi Daichi Sawamura Ushi~ "Sawamura Ushijima Wakatoshi D~ 0.366
6 Yuji Azumane Terushima "Azumane Terushima Yuji " 0.283
您可以strsplit
个人姓名,sort
他们和paste
。然后使用 match
.
x <- sapply(strsplit(df_1$name, " +"), function(x) paste(sort(x), collapse = " "))
y <- sapply(strsplit(df_2$name, " +"), function(x) paste(sort(x), collapse = " "))
cbind(df_1$name, df_2$name[match(x, y)])
# [,1] [,2]
#[1,] "Tetsurō Shoyo Hinata Kuroo" "Hinata Kuroo Tetsurō Shoyo"
#[2,] "Kōtarō Tobio Kageyama Bokuto" "Kageyama Bokuto Kōtarō Tobio"
#[3,] "Wakatoshi Daichi Sawamura Ushijima" "Sawamura Ushijima Wakatoshi Daichi"
#[4,] "Tōru Tsukishima Oikawa" "Tsukishima Oikawa Tōru "
#[5,] "Yūji Azumane Terushima" "Azumane Terushima Yūji "
#[6,] "Kenma Kozume" NA
我正在尝试使用包含字符串的列合并两个 data.frames。两列中的字符串是名称,不幸的是,它们的顺序不同。在下面的示例中,df_1
中的姓名结构为“name”+“midname”+“surname1”+“surname2”,而在 df_2
中的结构为“surname1”+“surname2”+“name” "+"中间名".
我首先尝试使用这些名称进行 fuzzy merge。但是,它并没有解决问题,因为完全不同的名称之间仍然存在非零匹配。此外,定义一个可以定义名称何时完全不同于另一个名称的切割点也很重要。我还期望具有相反顺序的名称之间的相似度更高(即,(名称+中名)+(姓氏1+姓氏2)以不同的顺序)。
您是否有更好的方法以不同的顺序使用这些名称合并两个 data.frame?提前致谢。
# "name"+"midname"+"surname1"+"surname2
df_1<- read.table(header = T,sep = "\t", text = "
name
Tetsurō Shoyo Hinata Kuroo
Kōtarō Tobio Kageyama Bokuto
Wakatoshi Daichi Sawamura Ushijima
Tōru Tsukishima Oikawa
Yūji Azumane Terushima
Kenma Kozume
")
# "surname1"+"surname2"+"name"+"midname".
df_2<- read.table(header = T,sep = "\t", text = "
name
Hinata Kuroo Tetsurō Shoyo
Kageyama Bokuto Kōtarō Tobio
Sawamura Ushijima Wakatoshi Daichi
Tsukishima Oikawa Tōru
Azumane Terushima Yūji
Kiyoomi Sakusa
")
library(fuzzyjoin); library(dplyr);
stringdist_join(df_1, df_2,
by = "name",
mode = "inner",
ignore_case = FALSE,
method = "jw",
max_dist = 99,
distance_col = "dist") %>%
group_by(name.x) %>%
slice_min(order_by = dist, n = 1)
结果
# A tibble: 6 x 3
# Groups: name.x [6]
name.x name.y dist
<chr> <chr> <dbl>
1 Kenma Kozume "Azumane Terushima Yuji " 0.416
2 Kotaro Tobio Kageyama Bokuto "Kageyama Bokuto Kotaro Tobio" 0.241
3 Tetsuro Shoyo Hinata Kuroo "Kageyama Bokuto Kotaro Tobio" 0.351
4 Toru Tsukishima Oikawa "Tsukishima Oikawa Toru " 0.302
5 Wakatoshi Daichi Sawamura Ushi~ "Sawamura Ushijima Wakatoshi D~ 0.366
6 Yuji Azumane Terushima "Azumane Terushima Yuji " 0.283
您可以strsplit
个人姓名,sort
他们和paste
。然后使用 match
.
x <- sapply(strsplit(df_1$name, " +"), function(x) paste(sort(x), collapse = " "))
y <- sapply(strsplit(df_2$name, " +"), function(x) paste(sort(x), collapse = " "))
cbind(df_1$name, df_2$name[match(x, y)])
# [,1] [,2]
#[1,] "Tetsurō Shoyo Hinata Kuroo" "Hinata Kuroo Tetsurō Shoyo"
#[2,] "Kōtarō Tobio Kageyama Bokuto" "Kageyama Bokuto Kōtarō Tobio"
#[3,] "Wakatoshi Daichi Sawamura Ushijima" "Sawamura Ushijima Wakatoshi Daichi"
#[4,] "Tōru Tsukishima Oikawa" "Tsukishima Oikawa Tōru "
#[5,] "Yūji Azumane Terushima" "Azumane Terushima Yūji "
#[6,] "Kenma Kozume" NA