识别 CSV 文件中 2 列或更多列中最常见的值组合
Identify most common combinations of values in 2 or more columns in a CSV file
如何在 CSV 文件中的行的 2 列或更多列中查找最常见的值组合。示例:
event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica
我尝试为 一些 我正在查看的可能组合创建列表,然后在 [=23= 中使用 most_common() 方法] 找到共同的模式。但我需要一种算法来查找 2 列或更多列的任何可能组合的公共记录。
到目前为止我的代码:
import csv
from collections import Counter
class Alert:
def __init__(self, event, rack, role, dc):
self.event = event
self.rack = rack
self.role = role
self.dc = dc
def __str__(self):
return(",".join([self.event, self.rack, self.role, self.dc]))
alerts = []
with open('data.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
alert = Alert(row['event'], row['rack'], row['role'], row['dc'])
alerts.append(alert)
dcevent= []
dceventrole = []
dcrole = []
dcrolerack = []
for alert in alerts:
dcevent.append(alert.dc + '-' + alert.event)
dceventrole.append(alert.dc+'-'+alert.event+'-'+alert.role)
dcrole.append(alert.dc+'-'+alert.role)
dcrolerack.append(alert.dc+'-'+alert.role+'-'+alert.rack)
masterlist = Counter(dcevent).most_common() + Counter(dceventrole).most_common() + Counter(dcrole).most_common() + Counter(dcrolerack).most_common()
for item in sorted(masterlist, key=lambda x: x[1], reverse=True):
print(item)
这是上述记录的输出:
('northamerica-web-cpu', 3) # there are 3 rows matching the values northamerica,web and cpu
('northamerica-web', 3) # there are 3 rows matching just the values northamerica and web
('northamerica-cpu', 3) # there are 3 rows matching northamerica and cpu
('europe-oom', 2) # there are 2 rows matching europe and oom
('africa-mobile-network', 1)
('asia-mobile-network', 1)
('europe-desktop-oom', 1)
('europe-mobile-oom', 1)
('africa-mobile-north', 1)
('asia-mobile-east', 1)
('europe-desktop-south', 1)
('northamerica-web-east', 1)
('europe-mobile-north', 1)
('northamerica-web-south', 1)
('northamerica-web-west', 1)
('africa-mobile', 1)
('asia-mobile', 1)
('europe-desktop', 1)
('europe-mobile', 1)
('africa-network', 1)
('asia-network', 1)
让我从现场定义数据结构开始,因为 csv 读取与实际问题正交:
lines = [line.split(',') for line in """\
event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica
""".splitlines()]
for line in lines:
print line
打印:
['event', 'rack', 'role', 'dc']
['network', 'north', 'mobile', 'africa']
['network', 'east', 'mobile', 'asia']
['oom', 'south', 'desktop', 'europe']
['cpu', 'east', 'web', 'northamerica']
['oom', 'north', 'mobile', 'europe']
['cpu', 'south', 'web', 'northamerica']
['cpu', 'west', 'web', 'northamerica']
现在,让我们从每行中创建 2 个或更多单词的所有可能组合。有 11 种方法可以从 4 中选择 2、3 或 4 (4C2 + 4C3 + 4C4 == 6 + 4 + 1 == 11).
我用来查找组合的算法着眼于具有 4 位数字的二进制数(即 0000、0001、0010、0011、0100 等),并且对于每个这样的数字创建单词组合取决于如果相应的二进制数字为 1。例如对于 0101,选择第二个和第四个字:
def find_combinations(line):
combinations = []
for i in range(2**len(line)):
bits = bin(i)[2:].zfill(len(line))
if bits.count('1') < 2: # skip numbers with less than two 1-bits
continue
combination = set()
for bit, word in zip(bits, line):
if bit == '1':
combination.add(word)
combinations.append('-'.join(sorted(combination)))
return combinations
现在我们可以遍历所有组合并计算它们的频率:
from collections import defaultdict
counter = defaultdict(int)
for line in lines:
for c in find_combinations(line):
counter[c] += 1
最后我们可以按频率排序(降序)
for combination_freq in sorted(counter.items(), key=lambda item: item[1], reverse=True):
print combination_freq
获得:
('cpu-northamerica', 3)
('northamerica-web', 3)
('cpu-northamerica-web', 3)
('cpu-web', 3)
('mobile-north', 2)
('mobile-network', 2)
('europe-oom', 2)
('east-network', 1)
('asia-east-mobile', 1)
('asia-east-network', 1)
('cpu-south-web', 1)
('east-northamerica-web', 1)
('europe-north', 1)
('cpu-east', 1)
...etc.
如何在 CSV 文件中的行的 2 列或更多列中查找最常见的值组合。示例:
event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica
我尝试为 一些 我正在查看的可能组合创建列表,然后在 [=23= 中使用 most_common() 方法] 找到共同的模式。但我需要一种算法来查找 2 列或更多列的任何可能组合的公共记录。
到目前为止我的代码:
import csv
from collections import Counter
class Alert:
def __init__(self, event, rack, role, dc):
self.event = event
self.rack = rack
self.role = role
self.dc = dc
def __str__(self):
return(",".join([self.event, self.rack, self.role, self.dc]))
alerts = []
with open('data.csv', mode='r') as csv_file:
csv_reader = csv.DictReader(csv_file)
for row in csv_reader:
alert = Alert(row['event'], row['rack'], row['role'], row['dc'])
alerts.append(alert)
dcevent= []
dceventrole = []
dcrole = []
dcrolerack = []
for alert in alerts:
dcevent.append(alert.dc + '-' + alert.event)
dceventrole.append(alert.dc+'-'+alert.event+'-'+alert.role)
dcrole.append(alert.dc+'-'+alert.role)
dcrolerack.append(alert.dc+'-'+alert.role+'-'+alert.rack)
masterlist = Counter(dcevent).most_common() + Counter(dceventrole).most_common() + Counter(dcrole).most_common() + Counter(dcrolerack).most_common()
for item in sorted(masterlist, key=lambda x: x[1], reverse=True):
print(item)
这是上述记录的输出:
('northamerica-web-cpu', 3) # there are 3 rows matching the values northamerica,web and cpu
('northamerica-web', 3) # there are 3 rows matching just the values northamerica and web
('northamerica-cpu', 3) # there are 3 rows matching northamerica and cpu
('europe-oom', 2) # there are 2 rows matching europe and oom
('africa-mobile-network', 1)
('asia-mobile-network', 1)
('europe-desktop-oom', 1)
('europe-mobile-oom', 1)
('africa-mobile-north', 1)
('asia-mobile-east', 1)
('europe-desktop-south', 1)
('northamerica-web-east', 1)
('europe-mobile-north', 1)
('northamerica-web-south', 1)
('northamerica-web-west', 1)
('africa-mobile', 1)
('asia-mobile', 1)
('europe-desktop', 1)
('europe-mobile', 1)
('africa-network', 1)
('asia-network', 1)
让我从现场定义数据结构开始,因为 csv 读取与实际问题正交:
lines = [line.split(',') for line in """\
event,rack,role,dc
network,north,mobile,africa
network,east,mobile,asia
oom,south,desktop,europe
cpu,east,web,northamerica
oom,north,mobile,europe
cpu,south,web,northamerica
cpu,west,web,northamerica
""".splitlines()]
for line in lines:
print line
打印:
['event', 'rack', 'role', 'dc']
['network', 'north', 'mobile', 'africa']
['network', 'east', 'mobile', 'asia']
['oom', 'south', 'desktop', 'europe']
['cpu', 'east', 'web', 'northamerica']
['oom', 'north', 'mobile', 'europe']
['cpu', 'south', 'web', 'northamerica']
['cpu', 'west', 'web', 'northamerica']
现在,让我们从每行中创建 2 个或更多单词的所有可能组合。有 11 种方法可以从 4 中选择 2、3 或 4 (4C2 + 4C3 + 4C4 == 6 + 4 + 1 == 11).
我用来查找组合的算法着眼于具有 4 位数字的二进制数(即 0000、0001、0010、0011、0100 等),并且对于每个这样的数字创建单词组合取决于如果相应的二进制数字为 1。例如对于 0101,选择第二个和第四个字:
def find_combinations(line):
combinations = []
for i in range(2**len(line)):
bits = bin(i)[2:].zfill(len(line))
if bits.count('1') < 2: # skip numbers with less than two 1-bits
continue
combination = set()
for bit, word in zip(bits, line):
if bit == '1':
combination.add(word)
combinations.append('-'.join(sorted(combination)))
return combinations
现在我们可以遍历所有组合并计算它们的频率:
from collections import defaultdict
counter = defaultdict(int)
for line in lines:
for c in find_combinations(line):
counter[c] += 1
最后我们可以按频率排序(降序)
for combination_freq in sorted(counter.items(), key=lambda item: item[1], reverse=True):
print combination_freq
获得:
('cpu-northamerica', 3)
('northamerica-web', 3)
('cpu-northamerica-web', 3)
('cpu-web', 3)
('mobile-north', 2)
('mobile-network', 2)
('europe-oom', 2)
('east-network', 1)
('asia-east-mobile', 1)
('asia-east-network', 1)
('cpu-south-web', 1)
('east-northamerica-web', 1)
('europe-north', 1)
('cpu-east', 1)
...etc.