按特定值匹配数据集中的实例记录
Matching instance records in a data set by specific value
我有一个解决方案,它工作正常,但性能不佳,需要一些时间 运行。让我们从最初的两个查询(都是双连接)返回的内容开始:
第一组数据如下所示 - 让我们称这些为 line_items
。如您所见,line_items
没有 dh_first_name
key/value.
[
[
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "101.0",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "11.62",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133590.0",
pbbname: "CUSTOMER",
opl_amount: "79.0",
ops_type: "P",
ops_stop_id: 269780,
ops_order_id: 133590,
ops_driver1: 104,
ops_delivered_time: null
},
{
pb_id: "133220.0",
pbbname: "CUSTOMER",
opl_amount: "625.0",
ops_type: "D",
ops_stop_id: 269011,
ops_order_id: 133220,
ops_driver1: 62,
ops_delivered_time: "2021-04-01T12:35:00.000-05:00"
},
{
pb_id: "133357.0",
pbbname: "CUSTOMER",
opl_amount: "550.0",
ops_type: "D",
ops_stop_id: 269290,
ops_order_id: 133357,
ops_driver1: 92,
ops_delivered_time: "2021-04-01T09:38:00.000-05:00"
},
{
pb_id: "133219.0",
pbbname: "CUSTOMER",
opl_amount: "1267.06",
ops_type: "P",
ops_stop_id: 269008,
ops_order_id: 133219,
ops_driver1: 43,
ops_delivered_time: null
},
{
pb_id: "133577.0",
pbbname: "CUSTOMER",
opl_amount: "150.0",
ops_type: "P",
ops_stop_id: 269754,
ops_order_id: 133577,
ops_driver1: 94,
ops_delivered_time: null
},
{
pb_id: "133503.0",
pbbname: "CUSTOMER",
opl_amount: "79.0",
ops_type: "P",
ops_stop_id: 269592,
ops_order_id: 133503,
ops_driver1: 104,
ops_delivered_time: null
},
{
pb_id: "133643.0",
pbbname: "HALLMARK CARDS BERMAN BLAKE",
opl_amount: "79.0",
ops_type: "P",
ops_stop_id: 269895,
ops_order_id: 133643,
ops_driver1: 104,
ops_delivered_time: null
}
]
]
现在,让我们看一下来自第二个双连接的下一组数据,即 line_stops
。它看起来像这样:
[
{
pb_id: "133633.0",
pbbname: "CUSTOMER",
pb_net_rev: "250.0",
ops_driver1: 59,
ops_stop_id: 269869,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-02T13:07:00.000-05:00"
},
{
pb_id: "133127.0",
pbbname: "CUSTOMER",
pb_net_rev: "1147.0",
ops_driver1: 102,
ops_stop_id: 268801,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: null
},
{
pb_id: "133144.0",
pbbname: "CUSTOMER",
pb_net_rev: "650.0",
ops_driver1: 71,
ops_stop_id: 268836,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-01T14:38:00.000-05:00"
},
{
pb_id: "133144.0",
pbbname: "CUSTOMER",
pb_net_rev: "650.0",
ops_driver1: 71,
ops_stop_id: 268837,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: null
},
{
pb_id: "133188.0",
pbbname: "CUSTOMER",
pb_net_rev: "700.0",
ops_driver1: 71,
ops_stop_id: 268924,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-01T08:04:00.000-05:00"
},
]
我目前正在做的是循环遍历它们并根据这些 values
.
匹配它们
ops_stop_id, ops_driver_1, pb_id
如果这三个匹配,那么我需要在特定驱动程序名称下构建它们,该名称只能来自具有 dh_first_name
的实例。此数据结构完成后如下所示:
{
FIRST LAST: [
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "101.0",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "11.62",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133536.0",
pbbname: "CUSTOMER",
opl_amount: "45.0",
ops_type: "P",
ops_stop_id: 269665,
ops_order_id: 133536,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133536.0",
pbbname: "CUSTOMER",
opl_amount: "5.18",
ops_type: "P",
ops_stop_id: 269665,
ops_order_id: 133536,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133522.0",
pbbname: "CUSTOMER",
opl_amount: "150.0",
ops_type: "P",
ops_stop_id: 269637,
ops_order_id: 133522,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133619.0",
pbbname: "CUSTOMER",
pb_net_rev: "550.0",
ops_driver1: 11,
ops_stop_id: 269841,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-02T11:41:00.000-05:00"
}
],
您将看到两个记录的混合,匹配参数组织正确。
这就是我目前解决问题的方式!
merger = {}
line_items.each do |lines, i|
line_stops.each do |stops|
if (lines.ops_stop_id == stops.ops_stop_id && lines.ops_driver1 == stops.ops_driver1 && lines.pb_id == stops.pb_id)
stops_arr.push(stops)
merger[stops.dh_first_name + ' ' + stops.dh_last_name] = (merger[stops.dh_first_name + ' ' + stops.dh_last_name] ||= []) << lines
end
end
end
line_stops.each do |stops|
if (!stops_arr.include?(stops))
stops_arr.push(stops)
merger[stops.dh_first_name + ' ' + stops.dh_last_name] = (merger[stops.dh_first_name + ' ' + stops.dh_last_name] ||= []) << stops
end
end
这太慢了,我认为这行是罪魁祸首:
(merger[stops.dh_first_name + ' ' + stops.dh_last_name] ||= []) << stops
你的代码的时间复杂度是O(lines.size * stops.size)
这里是我的提案,时间复杂度约为 O(lines.size + stops.size)
def merge_key(stops)
stops.dh_first_name + ' ' + stops.dh_last_name
end
# Note that hash_key code below maybe not good enough
def hash_key(lines)
"#{lines.ops_stop_id} #{lines.ops_driver1} #{lines.pb_id}"
end
merger = Hash.new { |hash, key| hash[key] = [] }
stops_hash = Hash.new
# O(line_stops.size)
line_stops.each do |stops|
merge_key = merge_key(stops)
next if merger.hash_key?(merge_key) # since in your code, you not add dup stops, right ?
merger[merge_key] << stops
stops_hash[hash_key(stops)] = merge_key
end
# O(line_items.size)
line_items.each do |lines, i|
if merge_key = stops_hash[hash_key(lines)]
merger[merge_key].unshift(lines) # since in your code, lines add before stops, right ?
end
end
我有一个解决方案,它工作正常,但性能不佳,需要一些时间 运行。让我们从最初的两个查询(都是双连接)返回的内容开始:
第一组数据如下所示 - 让我们称这些为 line_items
。如您所见,line_items
没有 dh_first_name
key/value.
[
[
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "101.0",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "11.62",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133590.0",
pbbname: "CUSTOMER",
opl_amount: "79.0",
ops_type: "P",
ops_stop_id: 269780,
ops_order_id: 133590,
ops_driver1: 104,
ops_delivered_time: null
},
{
pb_id: "133220.0",
pbbname: "CUSTOMER",
opl_amount: "625.0",
ops_type: "D",
ops_stop_id: 269011,
ops_order_id: 133220,
ops_driver1: 62,
ops_delivered_time: "2021-04-01T12:35:00.000-05:00"
},
{
pb_id: "133357.0",
pbbname: "CUSTOMER",
opl_amount: "550.0",
ops_type: "D",
ops_stop_id: 269290,
ops_order_id: 133357,
ops_driver1: 92,
ops_delivered_time: "2021-04-01T09:38:00.000-05:00"
},
{
pb_id: "133219.0",
pbbname: "CUSTOMER",
opl_amount: "1267.06",
ops_type: "P",
ops_stop_id: 269008,
ops_order_id: 133219,
ops_driver1: 43,
ops_delivered_time: null
},
{
pb_id: "133577.0",
pbbname: "CUSTOMER",
opl_amount: "150.0",
ops_type: "P",
ops_stop_id: 269754,
ops_order_id: 133577,
ops_driver1: 94,
ops_delivered_time: null
},
{
pb_id: "133503.0",
pbbname: "CUSTOMER",
opl_amount: "79.0",
ops_type: "P",
ops_stop_id: 269592,
ops_order_id: 133503,
ops_driver1: 104,
ops_delivered_time: null
},
{
pb_id: "133643.0",
pbbname: "HALLMARK CARDS BERMAN BLAKE",
opl_amount: "79.0",
ops_type: "P",
ops_stop_id: 269895,
ops_order_id: 133643,
ops_driver1: 104,
ops_delivered_time: null
}
]
]
现在,让我们看一下来自第二个双连接的下一组数据,即 line_stops
。它看起来像这样:
[
{
pb_id: "133633.0",
pbbname: "CUSTOMER",
pb_net_rev: "250.0",
ops_driver1: 59,
ops_stop_id: 269869,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-02T13:07:00.000-05:00"
},
{
pb_id: "133127.0",
pbbname: "CUSTOMER",
pb_net_rev: "1147.0",
ops_driver1: 102,
ops_stop_id: 268801,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: null
},
{
pb_id: "133144.0",
pbbname: "CUSTOMER",
pb_net_rev: "650.0",
ops_driver1: 71,
ops_stop_id: 268836,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-01T14:38:00.000-05:00"
},
{
pb_id: "133144.0",
pbbname: "CUSTOMER",
pb_net_rev: "650.0",
ops_driver1: 71,
ops_stop_id: 268837,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: null
},
{
pb_id: "133188.0",
pbbname: "CUSTOMER",
pb_net_rev: "700.0",
ops_driver1: 71,
ops_stop_id: 268924,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-01T08:04:00.000-05:00"
},
]
我目前正在做的是循环遍历它们并根据这些 values
.
ops_stop_id, ops_driver_1, pb_id
如果这三个匹配,那么我需要在特定驱动程序名称下构建它们,该名称只能来自具有 dh_first_name
的实例。此数据结构完成后如下所示:
{
FIRST LAST: [
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "101.0",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133599.0",
pbbname: "CUSTOMER",
opl_amount: "11.62",
ops_type: "P",
ops_stop_id: 269802,
ops_order_id: 133599,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133536.0",
pbbname: "CUSTOMER",
opl_amount: "45.0",
ops_type: "P",
ops_stop_id: 269665,
ops_order_id: 133536,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133536.0",
pbbname: "CUSTOMER",
opl_amount: "5.18",
ops_type: "P",
ops_stop_id: 269665,
ops_order_id: 133536,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133522.0",
pbbname: "CUSTOMER",
opl_amount: "150.0",
ops_type: "P",
ops_stop_id: 269637,
ops_order_id: 133522,
ops_driver1: 11,
ops_delivered_time: null
},
{
pb_id: "133619.0",
pbbname: "CUSTOMER",
pb_net_rev: "550.0",
ops_driver1: 11,
ops_stop_id: 269841,
dh_first_name: "FIRST",
dh_last_name: "LAST",
ops_delivered_time: "2021-04-02T11:41:00.000-05:00"
}
],
您将看到两个记录的混合,匹配参数组织正确。
这就是我目前解决问题的方式!
merger = {}
line_items.each do |lines, i|
line_stops.each do |stops|
if (lines.ops_stop_id == stops.ops_stop_id && lines.ops_driver1 == stops.ops_driver1 && lines.pb_id == stops.pb_id)
stops_arr.push(stops)
merger[stops.dh_first_name + ' ' + stops.dh_last_name] = (merger[stops.dh_first_name + ' ' + stops.dh_last_name] ||= []) << lines
end
end
end
line_stops.each do |stops|
if (!stops_arr.include?(stops))
stops_arr.push(stops)
merger[stops.dh_first_name + ' ' + stops.dh_last_name] = (merger[stops.dh_first_name + ' ' + stops.dh_last_name] ||= []) << stops
end
end
这太慢了,我认为这行是罪魁祸首:
(merger[stops.dh_first_name + ' ' + stops.dh_last_name] ||= []) << stops
你的代码的时间复杂度是O(lines.size * stops.size)
这里是我的提案,时间复杂度约为 O(lines.size + stops.size)
def merge_key(stops)
stops.dh_first_name + ' ' + stops.dh_last_name
end
# Note that hash_key code below maybe not good enough
def hash_key(lines)
"#{lines.ops_stop_id} #{lines.ops_driver1} #{lines.pb_id}"
end
merger = Hash.new { |hash, key| hash[key] = [] }
stops_hash = Hash.new
# O(line_stops.size)
line_stops.each do |stops|
merge_key = merge_key(stops)
next if merger.hash_key?(merge_key) # since in your code, you not add dup stops, right ?
merger[merge_key] << stops
stops_hash[hash_key(stops)] = merge_key
end
# O(line_items.size)
line_items.each do |lines, i|
if merge_key = stops_hash[hash_key(lines)]
merger[merge_key].unshift(lines) # since in your code, lines add before stops, right ?
end
end