如何在 stringdist 连接中锁定 'by' 列的第一位数字?
How do I lock the first digits of the 'by' column in a stringdist join?
我正在尝试使用 stringdist_join 合并两个表。我将 'by' 变量构建为三个变量的串联,这些变量的名称如下:
UAI : 序列号
nom : 姓氏
prenom : 名字
下面的代码运行良好,但我希望 UAI 部分完美匹配,它始终是变量 UAInomprenom 的前 8 个字符。我该怎么做?
stringdist_join(Ech_final_nom, BSA_affect_nom,
by = "UAInomprenom",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 0.1117,
distance_col = "dist")
感谢您的帮助!
我以下面两个数据集为例:
df1 <- structure(list(V1 = c("abcNum1Num1Num1Num1", "abc1Num1Num1Num1Num",
"accArv", "accbrf"), V2 = c(1L, 4L, 5L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c("abcNun1Nun1Nun1Nun1", "abc1Nun1Nun1Nun1Nun",
"accArv", "accNun1Nun1Nun1Nun1"), V2 = c(2L, 5L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-4L))
在这两个数据框中,变量V1
是join by
字段,其中前3个字符不模糊(在你的例子中,有8个不模糊字符)。
现在,将列 V1 分隔开,使其成为一个包含前 3 个字符的独立列:
library(fuzzyjoin)
library(tidyverse)
df1 <- df1 %>%
extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
df2 <- df2 %>%
extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
最后应用模糊连接,去掉三字符字段两列值不同的行:
stringdist_join(df1, df2,
by = "V1B",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 0.5) %>%
filter(V1A.x == V1A.y) %>%
unite("V1",c("V1A.x","V1B.x"),sep="") %>%
select(V1,V2=V2.x,V3=V2.y)
我正在尝试使用 stringdist_join 合并两个表。我将 'by' 变量构建为三个变量的串联,这些变量的名称如下:
UAI : 序列号 nom : 姓氏 prenom : 名字
下面的代码运行良好,但我希望 UAI 部分完美匹配,它始终是变量 UAInomprenom 的前 8 个字符。我该怎么做?
stringdist_join(Ech_final_nom, BSA_affect_nom,
by = "UAInomprenom",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 0.1117,
distance_col = "dist")
感谢您的帮助!
我以下面两个数据集为例:
df1 <- structure(list(V1 = c("abcNum1Num1Num1Num1", "abc1Num1Num1Num1Num",
"accArv", "accbrf"), V2 = c(1L, 4L, 5L, 2L)), class = "data.frame", row.names = c(NA,
-4L))
df2 <- structure(list(V1 = c("abcNun1Nun1Nun1Nun1", "abc1Nun1Nun1Nun1Nun",
"accArv", "accNun1Nun1Nun1Nun1"), V2 = c(2L, 5L, 4L, 1L)), class = "data.frame", row.names = c(NA,
-4L))
在这两个数据框中,变量V1
是join by
字段,其中前3个字符不模糊(在你的例子中,有8个不模糊字符)。
现在,将列 V1 分隔开,使其成为一个包含前 3 个字符的独立列:
library(fuzzyjoin)
library(tidyverse)
df1 <- df1 %>%
extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
df2 <- df2 %>%
extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
最后应用模糊连接,去掉三字符字段两列值不同的行:
stringdist_join(df1, df2,
by = "V1B",
mode = "left",
ignore_case = FALSE,
method = "jw",
max_dist = 0.5) %>%
filter(V1A.x == V1A.y) %>%
unite("V1",c("V1A.x","V1B.x"),sep="") %>%
select(V1,V2=V2.x,V3=V2.y)