合并 CSV 文件中类似字符串的值
Combining values from Similar Strings in CSV File
所以我有一个充满交易的 CSV 文件,其中一列是供应商名称,另一列是交易金额。目标是根据交易总数找到顶级供应商。那部分很简单,我有这样的代码:
with open('Transactions.csv') as Vendor_Data:
file_reader = csv.reader(Vendor_Data, delimiter=',')
vendor_dict = {}
next(file_reader)
for row in file_reader:
if row[3] not in vendor_dict:
vendor_dict[row[3]] = [0, 0]
vendor_dict[row[3]][1] += round(float(row[1]), 2)
else:
vendor_dict[row[3]][0] += 1
vendor_dict[row[3]][1] += round(float(row[1]), 2)
问题是,许多条目中同一供应商的拼写略有不同("Delta Airlines" v. "Delta Air")。在遍历 CSV 文件时检测这些相似字符串名称(例如,使用 Fuzzywuzzy)并合并交易实例和金额的最佳方法是什么?
import csv
from fuzzywuzzy import fuzz
with open('Transactions.csv') as Vendor_Data:
file_reader = csv.reader(Vendor_Data, delimiter=',')
vendor_dict = {}
next(file_reader) # skipping a header?
for row in file_reader:
# we can't use the dictionary directly (e.g. "key in vendor_dict")
# because we want to do a similarity search.
csv_name = row[3]
for vendor_name, vendor_values in vendor_dict.iteritems():
# this is *a* way to do it. You may want to use different scores
# or even a different comparison
if fuzz.token_set_ratio(csv_name, vendor_name) > 80:
vendor_values[0] += 1
vendor_values[1] += round(float(row[1]), 2)
break
else:
# we didn't find anything similar enough, so create an entry
vendor_values = [0, 0]
vendor_values[1] += round(float(row[1]), 2)
vendor_dict[csv_name] = vendor_values
读取pandas中的csv文件。然后为 fuzzywuzzy
百分比匹配添加一个新列。
创建一个关于哪个百分比应被视为相同字符串的阈值,然后通过使用 isin()
方法进行过滤然后添加交易金额的列值来进行计算。
将其循环到整个 DataFrame,您将获得所需的结果。
所以我有一个充满交易的 CSV 文件,其中一列是供应商名称,另一列是交易金额。目标是根据交易总数找到顶级供应商。那部分很简单,我有这样的代码:
with open('Transactions.csv') as Vendor_Data:
file_reader = csv.reader(Vendor_Data, delimiter=',')
vendor_dict = {}
next(file_reader)
for row in file_reader:
if row[3] not in vendor_dict:
vendor_dict[row[3]] = [0, 0]
vendor_dict[row[3]][1] += round(float(row[1]), 2)
else:
vendor_dict[row[3]][0] += 1
vendor_dict[row[3]][1] += round(float(row[1]), 2)
问题是,许多条目中同一供应商的拼写略有不同("Delta Airlines" v. "Delta Air")。在遍历 CSV 文件时检测这些相似字符串名称(例如,使用 Fuzzywuzzy)并合并交易实例和金额的最佳方法是什么?
import csv
from fuzzywuzzy import fuzz
with open('Transactions.csv') as Vendor_Data:
file_reader = csv.reader(Vendor_Data, delimiter=',')
vendor_dict = {}
next(file_reader) # skipping a header?
for row in file_reader:
# we can't use the dictionary directly (e.g. "key in vendor_dict")
# because we want to do a similarity search.
csv_name = row[3]
for vendor_name, vendor_values in vendor_dict.iteritems():
# this is *a* way to do it. You may want to use different scores
# or even a different comparison
if fuzz.token_set_ratio(csv_name, vendor_name) > 80:
vendor_values[0] += 1
vendor_values[1] += round(float(row[1]), 2)
break
else:
# we didn't find anything similar enough, so create an entry
vendor_values = [0, 0]
vendor_values[1] += round(float(row[1]), 2)
vendor_dict[csv_name] = vendor_values
读取pandas中的csv文件。然后为 fuzzywuzzy
百分比匹配添加一个新列。
创建一个关于哪个百分比应被视为相同字符串的阈值,然后通过使用 isin()
方法进行过滤然后添加交易金额的列值来进行计算。
将其循环到整个 DataFrame,您将获得所需的结果。