大约同时进行文本匹配和更新

approx text matching and updation at same time

我有一个作为 df1 的数据框,其中包含大学名称的一列 University_name,并且有 500000 行。现在我有另一个数据框 df2,它包含 2 列 university_name 和 university_aliases 并且有 150 行。现在我想将 university_aliases 列中存在的每个大学别名与 university_name_new.

中存在的大学名称相匹配

df1$ 样本university_name

university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland

df2 样本

University_Alias                  Univeristy_Name_new

univ of auckland                  university of auckland
universiry of auckland            university of auckland
auckland university               university of auckland
university of auckland            university of auckland
warwick university                university of warwick
warwick univercity                university of warwick
university of warwick             university of warwick
seneca college                    seneca college
unv of warwick                    university of warwick

我期待这样的输出

university of auckland
university of auckland
university of warwick
seneca college
seneca college

我正在使用以下代码,但它不起作用

 df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new

你可以这样做

df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college" 

现在例如,在您提供的数据中 the university of aucklanddf1$university_name 但不在 df2$University_Alias 中,这就是为什么我们有以下内容:

> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8

的确,从df1$university_name开始,df2$University_Alias中只有university of aucklandseneca college

您可以使用 sapplystr_extract 来获得想要的结果。

 # create sample data
df1 <- data.frame(university_name = c('university of auckland',
                                      'the university of auckland',
                                      'university of warwick - warwick business school',
                                      'seneca college of applied arts and technology',
                                      'seneca college'), stringsAsFactors = F)

# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')

# get the output
df1$output <- sapply(df1$university_name, function(z)({

    f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
    return(f)

}), USE.NAMES = F)

print(df1)

                                  university_name                 output
1                          university of auckland university of auckland
2                      the university of auckland university of auckland
3 university of warwick - warwick business school  university of warwick
4   seneca college of applied arts and technology         seneca college
5                                  seneca college         seneca college

更新:

根据我的理解,df2 已经有了 university_aliasuniversity_name_new 的一对一映射,所以问题归结为检查 university_alias 是否不是出现在 df1 中,我们将其删除。

# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])

# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]

print(df3)
            university_alias    university_name_new
1           univ of auckland university of auckland
4     university of auckland university of auckland
8             seneca college         seneca college
9             unv of warwick  university of warwick