模糊匹配列 headers 许多 data.frames 具有相同的列数？

Question

我有 ~6000 data.frames 列数相同，但是 human-labelled headers，这意味着有错别字之类的东西，偶尔还有一个额外的词（例如什么应该是address 可能显示为 street_address).

请注意，所有列名都非常不同（最接近的是 first_name 和 last_name），但所有其他列名在字词上没有重叠

是否已建立 'best practice' 来匹配列 headers 以便将它们组织成一个数据帧？

到目前为止，我猜是简单地根据匹配的字符数来匹配列（例如，street_address 可能会正确匹配 address，因为有 7 个字符匹配）

是否有更好/更成熟/更可靠的方法？

注意：我可以为此使用 R（最好是 dplyr）或 python（例如 pandas）（以及任一语言的任何其他库）

Answer 1

这里有一个窍门...

示例数据

df1 <- data.frame( first.name = c( "bobby", "carl" ),
                   last_name = c( "fisscher", "sagan") )
df2 <- data.frame( lst_name = c("ice","cream"),
                   frst_name = c("ben","jerry") )
df3 <- data.frame( first_nam = c("bert", "ernie"),
                   last_nam = c("elmo", "oscar"))


df1;df2;df3 

# first.name last_name
# 1      bobby  fisscher
# 2       carl     sagan

# lst_name frst_name
# 1      ice       ben
# 2    cream     jerry

# first_nam last_nam
# 1      bert     elmo
# 2     ernie    oscar

代码

library( stringdist )
library( data.table )

#add all data.frames to a list
L <- list(df1,df2,df3)

#reorder the df's, based on the stringdistance from 
#  the columnnames of df_n with those of df1
data.table::rbindlist(
  lapply( L, function(x) {
    #get stringdistance matrix of colnames
    temp <- stringdistmatrix( names(df1), names(x), useNames = TRUE )
    #get the colname of x that matches the one of df1 closest
    colOrder <- colnames(x)[apply(temp,1,which.min)]
    #reorder x accordingly
    x[, colOrder ]
  }),
  #rowbind, ignoring the columnnames, the order is all that matters
  use.names = FALSE )

#    first.name last_name
# 1:      bobby  fisscher
# 2:       carl     sagan
# 3:        ben       ice
# 4:      jerry     cream
# 5:       bert      elmo
# 6:      ernie     oscar

模糊匹配列 headers 许多 data.frames 具有相同的列数？

Fuzzy match column headers for many data.frames with same number of columns?

python

fuzzy-search

r

pandas

dplyr