如何使用保留数据的规则将两个 CSV 与几乎相同的集合合并? (使用 Ruby & FasterCSV)
How do I merge two CSV's with nearly identical sets using rules for which data is kept? (Using Ruby & FasterCSV)
我有两个 csv 文件,每个文件有 13 列。
每一行的第一列包含一个唯一的字符串。有些在每个文件中重复,有些只存在于一个文件中。
如果该行仅存在于一个文件中,我想将其保留在新文件中。
如果两者都存在,我想在同一行的特定列中保留具有特定值(或缺少特定值)的那个。
例如:
文件 1:
D600-DS-1991, name1, address1, date1
D601-DS-1991, name2, address2, date2
D601-DS-1992, name3, address3, date3
文件 2:
D600-DS-1991, name1, address1, time1
D601-DS-1992, dave1, address2, date2
我会保留第一个文件的第一行,因为第四列包含 date
而不是 time
。
我会保留第一个文件的第二行,因为它的第一列,第一行值是唯一的。
我会将第二个文件的第二行保留为新文件的第三行,因为它在第二列中包含 "name#" 以外的文本。
我是否应该首先将所有唯一值相互映射,以便每个文件都包含相同数量的条目 - 即使有些条目是空白的或只有填充数据?
我只知道一点 ruby 和 python...但是我更愿意尽可能用一个 Ruby 文件来解决这个问题,因为我将能够更好地理解代码。如果您无法在 Ruby 中完成,请随时以不同的方式回答!
我不会给你完整的代码,但这里有一个解决此类问题的一般方法:
require 'csv'
# list of csv files to read
files = ['a.csv', 'b.csv']
# used to resolve conflicts when we have a existing entry with same id
# here, we prefer the new entry if its fourth column starts with `'date'`
# this also means that the last file in the list above wins if both entries are valid.
def resolve_conflict(existing_entry, new_entry)
if new_entry[3].start_with? 'date'
new_entry
else
existing_entry
end
end
# keep a hash of entries, with the unique id as key.
# we use this id to detect duplicate entries later on.
entries = {}
CSV.foreach(file) do |new_entry|
# get id (first column) from row
id = new_entry[0]
# see if we have a conflicting entry
existing_entry = entries[id]
if existing_entry.nil?
# no conflict, just save the row
entries[id] = new_entry
else
# resolve conflict and save that
entries[id] = resolve_conflict(existing_entry, new_entry)
end
end
# now all conflicts are resolved
# note that stale rows from the first file could now be in the result
# you might want to filter them out as well
# we can now build a new csv file with the result
CSV.open("result.csv", "w") do |csv|
entries.values.each do |row|
csv << row
end
end
我对我的解决方案不是很满意,但它确实有效:
require 'csv'
def readcsv(filename)
csv = {}
CSV.foreach(filename) do |line|
csv[line[0]] = { name: line[1], address: line[2], date: line[3] }
end
csv
end
csv1 = readcsv('orders1.csv')
csv2 = readcsv('orders2.csv')
results = {}
csv1.each do |id, val|
unless csv2[id]
results[id] = val # checks to see if it only exists in 1 file
next
end
#see if name exists
if (val[:name] =~ /name/) && (csv2[id]) && (csv2[id][:name] =~ /name/).nil?
csv1.delete(id)
end
#missing some if statement regarding date vs. time
end
results = results.merge(csv2) # merge together whatever is remaining
CSV.open('newfile.csv', 'w') do |csv|
results.each do |key, val|
row = []
csv << (row.push(key, val.values)).flatten
end
end
newfile.csv
的输出:
D601-DS-1991, name2, address2, date2
D600-DS-1991, name1, address1, time1
D601-DS-1992, dave1, address2, date2
我有两个 csv 文件,每个文件有 13 列。
每一行的第一列包含一个唯一的字符串。有些在每个文件中重复,有些只存在于一个文件中。
如果该行仅存在于一个文件中,我想将其保留在新文件中。
如果两者都存在,我想在同一行的特定列中保留具有特定值(或缺少特定值)的那个。
例如:
文件 1:
D600-DS-1991, name1, address1, date1
D601-DS-1991, name2, address2, date2
D601-DS-1992, name3, address3, date3
文件 2:
D600-DS-1991, name1, address1, time1
D601-DS-1992, dave1, address2, date2
我会保留第一个文件的第一行,因为第四列包含 date
而不是 time
。
我会保留第一个文件的第二行,因为它的第一列,第一行值是唯一的。
我会将第二个文件的第二行保留为新文件的第三行,因为它在第二列中包含 "name#" 以外的文本。
我是否应该首先将所有唯一值相互映射,以便每个文件都包含相同数量的条目 - 即使有些条目是空白的或只有填充数据?
我只知道一点 ruby 和 python...但是我更愿意尽可能用一个 Ruby 文件来解决这个问题,因为我将能够更好地理解代码。如果您无法在 Ruby 中完成,请随时以不同的方式回答!
我不会给你完整的代码,但这里有一个解决此类问题的一般方法:
require 'csv'
# list of csv files to read
files = ['a.csv', 'b.csv']
# used to resolve conflicts when we have a existing entry with same id
# here, we prefer the new entry if its fourth column starts with `'date'`
# this also means that the last file in the list above wins if both entries are valid.
def resolve_conflict(existing_entry, new_entry)
if new_entry[3].start_with? 'date'
new_entry
else
existing_entry
end
end
# keep a hash of entries, with the unique id as key.
# we use this id to detect duplicate entries later on.
entries = {}
CSV.foreach(file) do |new_entry|
# get id (first column) from row
id = new_entry[0]
# see if we have a conflicting entry
existing_entry = entries[id]
if existing_entry.nil?
# no conflict, just save the row
entries[id] = new_entry
else
# resolve conflict and save that
entries[id] = resolve_conflict(existing_entry, new_entry)
end
end
# now all conflicts are resolved
# note that stale rows from the first file could now be in the result
# you might want to filter them out as well
# we can now build a new csv file with the result
CSV.open("result.csv", "w") do |csv|
entries.values.each do |row|
csv << row
end
end
我对我的解决方案不是很满意,但它确实有效:
require 'csv'
def readcsv(filename)
csv = {}
CSV.foreach(filename) do |line|
csv[line[0]] = { name: line[1], address: line[2], date: line[3] }
end
csv
end
csv1 = readcsv('orders1.csv')
csv2 = readcsv('orders2.csv')
results = {}
csv1.each do |id, val|
unless csv2[id]
results[id] = val # checks to see if it only exists in 1 file
next
end
#see if name exists
if (val[:name] =~ /name/) && (csv2[id]) && (csv2[id][:name] =~ /name/).nil?
csv1.delete(id)
end
#missing some if statement regarding date vs. time
end
results = results.merge(csv2) # merge together whatever is remaining
CSV.open('newfile.csv', 'w') do |csv|
results.each do |key, val|
row = []
csv << (row.push(key, val.values)).flatten
end
end
newfile.csv
的输出:
D601-DS-1991, name2, address2, date2
D600-DS-1991, name1, address1, time1
D601-DS-1992, dave1, address2, date2