模糊比较和聚合单个单列数据框中的相似记录
Fuzzy compare and aggregate similar records within a single single column data-frame
当前要求是聚合单个列并为每行提供一个计数。我遇到了一些需要帮助的问题:
- 由于参数或错误代码等其他信息,许多行相似但不准确。
- 所处理的数据是不可预测的,因此聚合需要具有一定程度的灵活性才能匹配。
- 无法知道哪里会出现差异或字符串的模式。
- 无法提前知道使某些行不同的消息或值。
所以我想做以下事情:
TextData
Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : ID record does not exist: @p_Id=11933
Message : ID record does not exist: @p_Id=21944
Message : ID record does not exist: @p_Id=31933
Message : ID record does not exist: @p_Id=41931
Message : ID record does not exist: @p_Id=51993
The duplicate key value is (129).
The duplicate key value is (129).
The duplicate key value is (135).
匹配并计入如下内容:
Count TextData Values
4 Message : @p_id is not valid
5 Message : ID record does not exist: @p_Id= 11933,21944,31933,41931,51993
3 The duplicate key value is (). 129,135
如果那是不可能的,那么至少进入这个
Count TextData
4 Message : @p_id is not valid
5 Message : ID record does not exist: @p_Id=
3 The duplicate key value is ().
我搜索了几个小时,试图为类似的事情找到解决方案,但没有找到有效或适合我的情况的示例。
data.table解决方案
library( data.table )
library( stringr )
#read data
dt <- fread(
"Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : ID record does not exist: @p_Id=11933
Message : ID record does not exist: @p_Id=21944
Message : ID record does not exist: @p_Id=31933
Message : ID record does not exist: @p_Id=41931
Message : ID record does not exist: @p_Id=51993
The duplicate key value is (129).
The duplicate key value is (129).
The duplicate key value is (135).", header = FALSE, sep = "")
#see if a string with numbers is present in the text, if so: extract
dt[, `:=`( id = stringr::str_extract( V1, "\d+" ),
V1 = ifelse( grepl ( "\d+", V1 ), gsub( "\d+", "", V1 ), V1 ) ) ]
#summarise
dt[, list( Count = .N, values = toString( unique( id ) ) ), by = V1][]
# V1 Count values
# 1: Message : @p_id is not valid 4 NA
# 2: Message : ID record does not exist: @p_Id= 5 11933, 21944, 31933, 41931, 51993
# 3: The duplicate key value is (). 3 129, 135
当前要求是聚合单个列并为每行提供一个计数。我遇到了一些需要帮助的问题:
- 由于参数或错误代码等其他信息,许多行相似但不准确。
- 所处理的数据是不可预测的,因此聚合需要具有一定程度的灵活性才能匹配。
- 无法知道哪里会出现差异或字符串的模式。
- 无法提前知道使某些行不同的消息或值。
所以我想做以下事情:
TextData
Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : ID record does not exist: @p_Id=11933
Message : ID record does not exist: @p_Id=21944
Message : ID record does not exist: @p_Id=31933
Message : ID record does not exist: @p_Id=41931
Message : ID record does not exist: @p_Id=51993
The duplicate key value is (129).
The duplicate key value is (129).
The duplicate key value is (135).
匹配并计入如下内容:
Count TextData Values
4 Message : @p_id is not valid
5 Message : ID record does not exist: @p_Id= 11933,21944,31933,41931,51993
3 The duplicate key value is (). 129,135
如果那是不可能的,那么至少进入这个
Count TextData
4 Message : @p_id is not valid
5 Message : ID record does not exist: @p_Id=
3 The duplicate key value is ().
我搜索了几个小时,试图为类似的事情找到解决方案,但没有找到有效或适合我的情况的示例。
data.table解决方案
library( data.table )
library( stringr )
#read data
dt <- fread(
"Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : @p_id is not valid
Message : ID record does not exist: @p_Id=11933
Message : ID record does not exist: @p_Id=21944
Message : ID record does not exist: @p_Id=31933
Message : ID record does not exist: @p_Id=41931
Message : ID record does not exist: @p_Id=51993
The duplicate key value is (129).
The duplicate key value is (129).
The duplicate key value is (135).", header = FALSE, sep = "")
#see if a string with numbers is present in the text, if so: extract
dt[, `:=`( id = stringr::str_extract( V1, "\d+" ),
V1 = ifelse( grepl ( "\d+", V1 ), gsub( "\d+", "", V1 ), V1 ) ) ]
#summarise
dt[, list( Count = .N, values = toString( unique( id ) ) ), by = V1][]
# V1 Count values
# 1: Message : @p_id is not valid 4 NA
# 2: Message : ID record does not exist: @p_Id= 5 11933, 21944, 31933, 41931, 51993
# 3: The duplicate key value is (). 3 129, 135