如何反向索引 Counter 并将其转换为 defaultdict? Python
How to reverse index a Counter and convert it into a defaultdict? Python
给定一个计数器,例如:
>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})
如何反转索引并将计数作为键并将值作为原始字符串键的列表?
目的是将上面的例子转换成这样:
defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})
我试过手动重复 Counter
:
>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})
>>> x = Counter('123112415121361273')
>>> from collections import Counter, defaultdict
>>> y = defaultdict(list)
>>> for s, count in x.items():
... y[count].append(s)
...
>>> y
defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})
但是还有其他方法吗?
由于输入是字符串 '123112415121361273'
并且输出应该是由计数索引的字典,有什么方法可以避免第一次迭代时的计数步骤并得到生成的 defaultdict?
不,没有更有效的方法。
计数最好用映射完成,这正是 Counter
所做的。由于在完全遍历字符串之前您不知道任何字符的最终计数,因此在完成计数之前您无法预先知道将字符归档到哪个桶中。
因此 效率低 的替代方法是从计数到字符的映射开始,然后将字符向上移动到下一个存储桶,因为您发现它们已经有了计数。发现他们已经有一个计数需要你对每个桶进行测试,这样就变成了一个 O(NK) 解决方案,而不是 Counter
给你的直接线性 O(N) 解决方案。
## Warning: this is not an efficient approach; use for illustration purposes only
from collections import defaultdict
s = '123112415121361273'
count_to_char = defaultdict(set) # use a set to avoid O(N**2) performance
max_count = 0
for char in s: # loop over N items
for i in range(1, max_count + 1): # loop over up to K buckets
if char in count_to_char[i]:
count_to_char[i].remove(char)
count_to_char[i + 1].add(char)
break
else:
i = 0
count_to_char[1].add(char)
max_count = max(i + 1, max_count)
# remove empty buckets again
for count in [c for c, b in count_to_char.items() if not b]:
del count_to_char[count]
# alternative method to clear empty buckets, producing a regular dict
# count_to_char = {c: b for c, b in count_to_char.items() if b}
避免扫描 K 个桶的方法是使用您已经使用的计数器。
from timeit import timeit
from random import choice
from collections import Counter, defaultdict
from string import printable
def str_count(input_num, defaultdict=defaultdict):
d = defaultdict(list)
for count, s in map(lambda x: (input_num.count(x), x), set(input_num)):
d[count].append(s)
return d
def counter(input_num, defaultdict=defaultdict, Counter=Counter):
x = Counter(input_num)
y = defaultdict(list)
for s, count in x.items():
y[count].append(s)
return y
def pieters_default_dict(input_num, defaultdict=defaultdict):
x = defaultdict(int)
for c in input_num:
x[c] += 1
y = defaultdict(list)
for s, count in x.items():
y[count].append(s)
return y
def pieters_buckets(input_num, defaultdict=defaultdict):
## Warning: this is not an efficient approach; use for illustration purposes only
count_to_char = defaultdict(set) # use a set to avoid O(N**2) performance
max_count = 0
for char in input_num: # loop over N items
for i in range(1, max_count + 1): # loop over up to K buckets
if char in count_to_char[i]:
count_to_char[i].remove(char)
count_to_char[i + 1].add(char)
break
else:
i = 0
count_to_char[1].add(char)
max_count = max(i + 1, max_count)
# remove empty buckets again
for count in [c for c, b in count_to_char.items() if not b]:
del count_to_char[count]
return count_to_char
test = ''.join([choice(printable) for _ in range(1000)])
number = 100
print('str_count: ', timeit('f(t)', 'from __main__ import str_count as f, test as t', number=number))
print('pieters_default_dict: ', timeit('f(t)', 'from __main__ import pieters_default_dict as f, test as t', number=number))
print('Counter: ', timeit('f(t)', 'from __main__ import counter as f, test as t', number=number))
print('pieters_buckets: ', timeit('f(t)', 'from __main__ import pieters_buckets as f, test as t', number=number))
Timeit Python 2.7.12 和 iteritems() 返回:
pieters_default_dict: 0.013843059539794922
str_count: 0.016570091247558594
Counter: 0.030740022659301758
pieters_buckets: 0.1262810230255127
在 Python 3.5.2 和 items() 上:
Counter: 0.00771436400100356
pieters_default_dict: 0.013124741999490652
str_count: 0.017287666001720936
pieters_buckets: 0.11816959099996893
更新
- 从 str_count() 函数中消除了不必要的循环,感谢 Martijn Pieters。
- 此脚本已在 Python 2.7.12
上测试
更新 2
- 删除函数中的导入,跟进评论。
- 修复了从问题中复制的代码,因为正在创建两个计数器。
更新 3
- 将 timeit 结果添加到由 Pieters 编写的脚本的其他两个版本中。
- 在 Python 3.5.2
上添加了 timeit 结果
- 创建 pieters_buckets() 方法只是为了说明目的,所以在这里计时只是出于好奇。
更新 4
- 修复了代码以使用随机字符串和更多的重复次数。
给定一个计数器,例如:
>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})
如何反转索引并将计数作为键并将值作为原始字符串键的列表?
目的是将上面的例子转换成这样:
defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})
我试过手动重复 Counter
:
>>> from collections import Counter
>>> Counter('123112415121361273')
Counter({'1': 7, '2': 4, '3': 3, '5': 1, '4': 1, '7': 1, '6': 1})
>>> x = Counter('123112415121361273')
>>> from collections import Counter, defaultdict
>>> y = defaultdict(list)
>>> for s, count in x.items():
... y[count].append(s)
...
>>> y
defaultdict(<type 'list'>, {1: ['5', '4', '7', '6'], 3: ['3'], 4: ['2'], 7: ['1']})
但是还有其他方法吗?
由于输入是字符串 '123112415121361273'
并且输出应该是由计数索引的字典,有什么方法可以避免第一次迭代时的计数步骤并得到生成的 defaultdict?
不,没有更有效的方法。
计数最好用映射完成,这正是 Counter
所做的。由于在完全遍历字符串之前您不知道任何字符的最终计数,因此在完成计数之前您无法预先知道将字符归档到哪个桶中。
因此 效率低 的替代方法是从计数到字符的映射开始,然后将字符向上移动到下一个存储桶,因为您发现它们已经有了计数。发现他们已经有一个计数需要你对每个桶进行测试,这样就变成了一个 O(NK) 解决方案,而不是 Counter
给你的直接线性 O(N) 解决方案。
## Warning: this is not an efficient approach; use for illustration purposes only
from collections import defaultdict
s = '123112415121361273'
count_to_char = defaultdict(set) # use a set to avoid O(N**2) performance
max_count = 0
for char in s: # loop over N items
for i in range(1, max_count + 1): # loop over up to K buckets
if char in count_to_char[i]:
count_to_char[i].remove(char)
count_to_char[i + 1].add(char)
break
else:
i = 0
count_to_char[1].add(char)
max_count = max(i + 1, max_count)
# remove empty buckets again
for count in [c for c, b in count_to_char.items() if not b]:
del count_to_char[count]
# alternative method to clear empty buckets, producing a regular dict
# count_to_char = {c: b for c, b in count_to_char.items() if b}
避免扫描 K 个桶的方法是使用您已经使用的计数器。
from timeit import timeit
from random import choice
from collections import Counter, defaultdict
from string import printable
def str_count(input_num, defaultdict=defaultdict):
d = defaultdict(list)
for count, s in map(lambda x: (input_num.count(x), x), set(input_num)):
d[count].append(s)
return d
def counter(input_num, defaultdict=defaultdict, Counter=Counter):
x = Counter(input_num)
y = defaultdict(list)
for s, count in x.items():
y[count].append(s)
return y
def pieters_default_dict(input_num, defaultdict=defaultdict):
x = defaultdict(int)
for c in input_num:
x[c] += 1
y = defaultdict(list)
for s, count in x.items():
y[count].append(s)
return y
def pieters_buckets(input_num, defaultdict=defaultdict):
## Warning: this is not an efficient approach; use for illustration purposes only
count_to_char = defaultdict(set) # use a set to avoid O(N**2) performance
max_count = 0
for char in input_num: # loop over N items
for i in range(1, max_count + 1): # loop over up to K buckets
if char in count_to_char[i]:
count_to_char[i].remove(char)
count_to_char[i + 1].add(char)
break
else:
i = 0
count_to_char[1].add(char)
max_count = max(i + 1, max_count)
# remove empty buckets again
for count in [c for c, b in count_to_char.items() if not b]:
del count_to_char[count]
return count_to_char
test = ''.join([choice(printable) for _ in range(1000)])
number = 100
print('str_count: ', timeit('f(t)', 'from __main__ import str_count as f, test as t', number=number))
print('pieters_default_dict: ', timeit('f(t)', 'from __main__ import pieters_default_dict as f, test as t', number=number))
print('Counter: ', timeit('f(t)', 'from __main__ import counter as f, test as t', number=number))
print('pieters_buckets: ', timeit('f(t)', 'from __main__ import pieters_buckets as f, test as t', number=number))
Timeit Python 2.7.12 和 iteritems() 返回:
pieters_default_dict: 0.013843059539794922
str_count: 0.016570091247558594
Counter: 0.030740022659301758
pieters_buckets: 0.1262810230255127
在 Python 3.5.2 和 items() 上:
Counter: 0.00771436400100356
pieters_default_dict: 0.013124741999490652
str_count: 0.017287666001720936
pieters_buckets: 0.11816959099996893
更新
- 从 str_count() 函数中消除了不必要的循环,感谢 Martijn Pieters。
- 此脚本已在 Python 2.7.12 上测试
更新 2
- 删除函数中的导入,跟进评论。
- 修复了从问题中复制的代码,因为正在创建两个计数器。
更新 3
- 将 timeit 结果添加到由 Pieters 编写的脚本的其他两个版本中。
- 在 Python 3.5.2 上添加了 timeit 结果
- 创建 pieters_buckets() 方法只是为了说明目的,所以在这里计时只是出于好奇。
更新 4
- 修复了代码以使用随机字符串和更多的重复次数。