模糊匹配列 headers 许多 data.frames 具有相同的列数?
Fuzzy match column headers for many data.frames with same number of columns?
我有 ~6000 data.frames 列数相同,但是 human-labelled headers,这意味着有错别字之类的东西,偶尔还有一个额外的词(例如什么应该是address
可能显示为 street_address
).
请注意,所有列名都非常不同(最接近的是 first_name
和 last_name
),但所有其他列名在字词上没有重叠
是否已建立 'best practice' 来匹配列 headers 以便将它们组织成一个数据帧?
到目前为止,我猜是简单地根据匹配的字符数来匹配列(例如,street_address
可能会正确匹配 address
,因为有 7 个字符匹配)
是否有更好/更成熟/更可靠的方法?
注意:我可以为此使用 R(最好是 dplyr)或 python(例如 pandas)(以及任一语言的任何其他库)
这里有一个窍门...
示例数据
df1 <- data.frame( first.name = c( "bobby", "carl" ),
last_name = c( "fisscher", "sagan") )
df2 <- data.frame( lst_name = c("ice","cream"),
frst_name = c("ben","jerry") )
df3 <- data.frame( first_nam = c("bert", "ernie"),
last_nam = c("elmo", "oscar"))
df1;df2;df3
# first.name last_name
# 1 bobby fisscher
# 2 carl sagan
# lst_name frst_name
# 1 ice ben
# 2 cream jerry
# first_nam last_nam
# 1 bert elmo
# 2 ernie oscar
代码
library( stringdist )
library( data.table )
#add all data.frames to a list
L <- list(df1,df2,df3)
#reorder the df's, based on the stringdistance from
# the columnnames of df_n with those of df1
data.table::rbindlist(
lapply( L, function(x) {
#get stringdistance matrix of colnames
temp <- stringdistmatrix( names(df1), names(x), useNames = TRUE )
#get the colname of x that matches the one of df1 closest
colOrder <- colnames(x)[apply(temp,1,which.min)]
#reorder x accordingly
x[, colOrder ]
}),
#rowbind, ignoring the columnnames, the order is all that matters
use.names = FALSE )
# first.name last_name
# 1: bobby fisscher
# 2: carl sagan
# 3: ben ice
# 4: jerry cream
# 5: bert elmo
# 6: ernie oscar
我有 ~6000 data.frames 列数相同,但是 human-labelled headers,这意味着有错别字之类的东西,偶尔还有一个额外的词(例如什么应该是address
可能显示为 street_address
).
请注意,所有列名都非常不同(最接近的是 first_name
和 last_name
),但所有其他列名在字词上没有重叠
是否已建立 'best practice' 来匹配列 headers 以便将它们组织成一个数据帧?
到目前为止,我猜是简单地根据匹配的字符数来匹配列(例如,street_address
可能会正确匹配 address
,因为有 7 个字符匹配)
是否有更好/更成熟/更可靠的方法?
注意:我可以为此使用 R(最好是 dplyr)或 python(例如 pandas)(以及任一语言的任何其他库)
这里有一个窍门...
示例数据
df1 <- data.frame( first.name = c( "bobby", "carl" ),
last_name = c( "fisscher", "sagan") )
df2 <- data.frame( lst_name = c("ice","cream"),
frst_name = c("ben","jerry") )
df3 <- data.frame( first_nam = c("bert", "ernie"),
last_nam = c("elmo", "oscar"))
df1;df2;df3
# first.name last_name
# 1 bobby fisscher
# 2 carl sagan
# lst_name frst_name
# 1 ice ben
# 2 cream jerry
# first_nam last_nam
# 1 bert elmo
# 2 ernie oscar
代码
library( stringdist )
library( data.table )
#add all data.frames to a list
L <- list(df1,df2,df3)
#reorder the df's, based on the stringdistance from
# the columnnames of df_n with those of df1
data.table::rbindlist(
lapply( L, function(x) {
#get stringdistance matrix of colnames
temp <- stringdistmatrix( names(df1), names(x), useNames = TRUE )
#get the colname of x that matches the one of df1 closest
colOrder <- colnames(x)[apply(temp,1,which.min)]
#reorder x accordingly
x[, colOrder ]
}),
#rowbind, ignoring the columnnames, the order is all that matters
use.names = FALSE )
# first.name last_name
# 1: bobby fisscher
# 2: carl sagan
# 3: ben ice
# 4: jerry cream
# 5: bert elmo
# 6: ernie oscar