大约同时进行文本匹配和更新
approx text matching and updation at same time
我有一个作为 df1 的数据框,其中包含大学名称的一列 University_name,并且有 500000 行。现在我有另一个数据框 df2,它包含 2 列 university_name 和 university_aliases 并且有 150 行。现在我想将 university_aliases 列中存在的每个大学别名与 university_name_new.
中存在的大学名称相匹配
df1$ 样本university_name
university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland
df2 样本
University_Alias Univeristy_Name_new
univ of auckland university of auckland
universiry of auckland university of auckland
auckland university university of auckland
university of auckland university of auckland
warwick university university of warwick
warwick univercity university of warwick
university of warwick university of warwick
seneca college seneca college
unv of warwick university of warwick
我期待这样的输出
university of auckland
university of auckland
university of warwick
seneca college
seneca college
我正在使用以下代码,但它不起作用
df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new
你可以这样做
df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college"
现在例如,在您提供的数据中 the university of auckland
在 df1$university_name
但不在 df2$University_Alias
中,这就是为什么我们有以下内容:
> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8
的确,从df1$university_name
开始,df2$University_Alias
中只有university of auckland
和seneca college
。
您可以使用 sapply
和 str_extract
来获得想要的结果。
# create sample data
df1 <- data.frame(university_name = c('university of auckland',
'the university of auckland',
'university of warwick - warwick business school',
'seneca college of applied arts and technology',
'seneca college'), stringsAsFactors = F)
# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')
# get the output
df1$output <- sapply(df1$university_name, function(z)({
f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
return(f)
}), USE.NAMES = F)
print(df1)
university_name output
1 university of auckland university of auckland
2 the university of auckland university of auckland
3 university of warwick - warwick business school university of warwick
4 seneca college of applied arts and technology seneca college
5 seneca college seneca college
更新:
根据我的理解,df2
已经有了 university_alias
和 university_name_new
的一对一映射,所以问题归结为检查 university_alias 是否不是出现在 df1 中,我们将其删除。
# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])
# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]
print(df3)
university_alias university_name_new
1 univ of auckland university of auckland
4 university of auckland university of auckland
8 seneca college seneca college
9 unv of warwick university of warwick
我有一个作为 df1 的数据框,其中包含大学名称的一列 University_name,并且有 500000 行。现在我有另一个数据框 df2,它包含 2 列 university_name 和 university_aliases 并且有 150 行。现在我想将 university_aliases 列中存在的每个大学别名与 university_name_new.
中存在的大学名称相匹配df1$ 样本university_name
university of auckland
the university of auckland
university of warwick - warwick business school
unv of warwick
seneca college of applied arts and technology
seneca college
univ of auckland
df2 样本
University_Alias Univeristy_Name_new
univ of auckland university of auckland
universiry of auckland university of auckland
auckland university university of auckland
university of auckland university of auckland
warwick university university of warwick
warwick univercity university of warwick
university of warwick university of warwick
seneca college seneca college
unv of warwick university of warwick
我期待这样的输出
university of auckland
university of auckland
university of warwick
seneca college
seneca college
我正在使用以下代码,但它不起作用
df$university_name[ grepl(df$university_name,df2$university_alias)] <- df2$university_name_new
你可以这样做
df2$University_Name_new[which(is.element(df2$University_Alias, df1$university_name))]
### which returns the following ####
[1] "university of auckland" "seneca college"
现在例如,在您提供的数据中 the university of auckland
在 df1$university_name
但不在 df2$University_Alias
中,这就是为什么我们有以下内容:
> which(is.element(df2$University_Alias, df1$university_name))
[1] 4 8
的确,从df1$university_name
开始,df2$University_Alias
中只有university of auckland
和seneca college
。
您可以使用 sapply
和 str_extract
来获得想要的结果。
# create sample data
df1 <- data.frame(university_name = c('university of auckland',
'the university of auckland',
'university of warwick - warwick business school',
'seneca college of applied arts and technology',
'seneca college'), stringsAsFactors = F)
# these are values to match (from df2)
vals <- c('university of auckland','university of warwick','seneca college')
# get the output
df1$output <- sapply(df1$university_name, function(z)({
f <- vals[complete.cases(str_extract(string = z, pattern = vals))]
return(f)
}), USE.NAMES = F)
print(df1)
university_name output
1 university of auckland university of auckland
2 the university of auckland university of auckland
3 university of warwick - warwick business school university of warwick
4 seneca college of applied arts and technology seneca college
5 seneca college seneca college
更新:
根据我的理解,df2
已经有了 university_alias
和 university_name_new
的一对一映射,所以问题归结为检查 university_alias 是否不是出现在 df1 中,我们将其删除。
# check values for university_alias in university_name
maps2 <- as.character(df2$university_alias[which(df2$university_alias %in% df1$university_name)])
# remove unmatched rows from df2
df3 <- df2[df2$university_alias %in% maps2,]
print(df3)
university_alias university_name_new
1 univ of auckland university of auckland
4 university of auckland university of auckland
8 seneca college seneca college
9 unv of warwick university of warwick