给定一百万个数字的字符串，return 所有重复的 3 位数字

Question

几个月前我在纽约面试了一家对冲基金公司，很遗憾，我没有得到 data/software 工程师的实习机会。（他们还要求解决方案在 Python 中。）

我在第一次面试问题上几乎搞砸了...

Question: Given a string of a million numbers (Pi for example), write a function/program that returns all repeating 3 digit numbers and number of repetition greater than 1

例如：如果字符串是：123412345123456 那么 function/program 将是 return:

123 - 3 times
234 - 3 times
345 - 2 times

在我面试失败后他们没有给我解决方案，但他们确实告诉我解决方案的时间复杂度恒定为 1000，因为所有可能的结果都介于：

000 --> 999

现在想想，我觉得不可能想出一个恒定时间的算法。是吗？

Answer 1

常数时间是不可能的。所有 100 万位数字都需要至少查看一次，因此时间复杂度为 O(n)，在这种情况下 n = 100 万。

对于简单的 O(n) 解决方案，创建一个大小为 1000 的数组，表示每个可能的 3 位数字出现的次数。一次前进 1 位，第一个索引 == 0，最后一个索引 == 999997，并递增数组 [3 位数字] 以创建直方图（每个可能的 3 位数字的出现次数）。然后输出计数> 1的数组内容。

Answer 2

简单的 O(n) 解决方案是计算每个 3 位数：

for nr in range(1000):
    cnt = text.count('%03d' % nr)
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

这将搜索所有 100 万个数字 1000 次。

只遍历数字一次：

counts = [0] * 1000
for idx in range(len(text)-2):
    counts[int(text[idx:idx+3])] += 1

for nr, cnt in enumerate(counts):
    if cnt > 1:
        print '%03d is found %d times' % (nr, cnt)

计时显示仅在索引上迭代一次是使用 count 的两倍快。

Answer 3

根据我的理解，你不可能在恒定时间内找到解决方案。它将至少通过百万位数字（假设它是一个字符串）。您可以对百万长度数字的数字进行 3 位滚动迭代，如果哈希键已经存在，则将哈希键的值增加 1，或者如果哈希键不存在，则创建一个新的哈希键（由值 1 初始化）词典。

代码看起来像这样：

def calc_repeating_digits(number):

    hash = {}

    for i in range(len(str(number))-2):

        current_three_digits = number[i:i+3]
        if current_three_digits in hash.keys():
            hash[current_three_digits] += 1

        else:
            hash[current_three_digits] = 1

    return hash

您可以向下过滤到项目值大于 1 的键。

Answer 4

我会按如下方式解决问题：

def find_numbers(str_num):
    final_dict = {}
    buffer = {}
    for idx in range(len(str_num) - 3):
        num = int(str_num[idx:idx + 3])
        if num not in buffer:
            buffer[num] = 0
        buffer[num] += 1
        if buffer[num] > 1:
            final_dict[num] = buffer[num]
    return final_dict

应用于您的示例字符串，得到：

>>> find_numbers("123412345123456")
{345: 2, 234: 3, 123: 3}

此解决方案的运行时间复杂度为 O(n)，因为 n 是所提供字符串的长度，我想这是您可以获得的最佳解决方案。

Answer 5

这是 "consensus" O(n) 算法的 NumPy 实现：边走边遍历所有三元组和 bin。合并是通过在遇到“385”时将 1 添加到 bin[3, 8, 5] 来完成的，这是一个 O(1) 操作。垃圾箱排列在 10x10x10 立方体中。由于合并是完全矢量化的，因此代码中没有循环。

def setup_data(n):
    import random
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))

def f_np(text):
    # Get the data into NumPy
    import numpy as np
    a = np.frombuffer(bytes(text, 'utf8'), dtype=np.uint8) - ord('0')
    # Rolling triplets
    a3 = np.lib.stride_tricks.as_strided(a, (3, a.size-2), 2*a.strides)

    bins = np.zeros((10, 10, 10), dtype=int)
    # Next line performs O(n) binning
    np.add.at(bins, tuple(a3), 1)
    # Filtering is left as an exercise
    return bins.ravel()

def f_py(text):
    counts = [0] * 1000
    for idx in range(len(text)-2):
        counts[int(text[idx:idx+3])] += 1
    return counts

import numpy as np
import types
from timeit import timeit
for n in (10, 1000, 1000000):
    data = setup_data(n)
    ref = f_np(**data)
    print(f'n = {n}')
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        try:
            assert np.all(ref == func(**data))
            print("{:16s}{:16.8f} ms".format(name[2:], timeit(
                'f(**data)', globals={'f':func, 'data':data}, number=10)*100))
        except:
            print("{:16s} apparently crashed".format(name[2:]))

毫不奇怪，NumPy 在大型数据集上比@Daniel 的纯 Python 解决方案快一点。示例输出：

# n = 10
# np                    0.03481400 ms
# py                    0.00669330 ms
# n = 1000
# np                    0.11215360 ms
# py                    0.34836530 ms
# n = 1000000
# np                   82.46765980 ms
# py                  360.51235450 ms

Answer 6

这是我的答案：

from timeit import timeit
from collections import Counter
import types
import random

def setup_data(n):
    digits = "0123456789"
    return dict(text = ''.join(random.choice(digits) for i in range(n)))


def f_counter(text):
    c = Counter()
    for i in range(len(text)-2):
        ss = text[i:i+3]
        c.update([ss])
    return (i for i in c.items() if i[1] > 1)

def f_dict(text):
    d = {}
    for i in range(len(text)-2):
        ss = text[i:i+3]
        if ss not in d:
            d[ss] = 0
        d[ss] += 1
    return ((i, d[i]) for i in d if d[i] > 1)

def f_array(text):
    a = [[[0 for _ in range(10)] for _ in range(10)] for _ in range(10)]
    for n in range(len(text)-2):
        i, j, k = (int(ss) for ss in text[n:n+3])
        a[i][j][k] += 1
    for i, b in enumerate(a):
        for j, c in enumerate(b):
            for k, d in enumerate(c):
                if d > 1: yield (f'{i}{j}{k}', d)


for n in (1E1, 1E3, 1E6):
    n = int(n)
    data = setup_data(n)
    print(f'n = {n}')
    results = {}
    for name, func in list(globals().items()):
        if not name.startswith('f_') or not isinstance(func, types.FunctionType):
            continue
        print("{:16s}{:16.8f} ms".format(name[2:], timeit(
            'results[name] = f(**data)', globals={'f':func, 'data':data, 'results':results, 'name':name}, number=10)*100))
    for r in results:
        print('{:10}: {}'.format(r, sorted(list(results[r]))[:5]))

数组查找方法非常快（甚至比@paul-panzer 的 numpy 方法还要快！）。当然，它会作弊，因为它在完成后并没有技术上完成，因为它返回了一个生成器。如果该值已经存在，它也不必检查每次迭代，这可能会有很大帮助。

n = 10
counter               0.10595780 ms
dict                  0.01070654 ms
array                 0.00135370 ms
f_counter : []
f_dict    : []
f_array   : []
n = 1000
counter               2.89462101 ms
dict                  0.40434612 ms
array                 0.00073838 ms
f_counter : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_dict    : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
f_array   : [('008', 2), ('009', 3), ('010', 2), ('016', 2), ('017', 2)]
n = 1000000
counter            2849.00500992 ms
dict                438.44007806 ms
array                 0.00135370 ms
f_counter : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_dict    : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]
f_array   : [('000', 1058), ('001', 943), ('002', 1030), ('003', 982), ('004', 1042)]

Answer 7

你从容地离开了，你可能不想为宽客不了解基本算法的对冲基金工作:-)

没有方法可以在O(1) 中处理任意大小的数据结构，如果在这种情况下，您需要至少访问每个元素一次。在这种情况下，您可以希望的最佳是 O(n)，其中 n 是字符串的长度。

Although, as an aside, a nominal O(n) algorithm will be O(1) for a fixed input size so, technically, they may have been correct here. However, that's not usually how people use complexity analysis.

在我看来，您可以通过多种方式给他们留下深刻印象。

首先，通过告知他们 不可能 在 O(1) 中做到这一点，除非您使用上面给出的 "suspect" 推理。

其次，通过提供 Pythonic 代码来展示您的精英技能，例如：

inpStr = '123412345123456'

# O(1) array creation.
freq = [0] * 1000

# O(n) string processing.
for val in [int(inpStr[pos:pos+3]) for pos in range(len(inpStr) - 2)]:
    freq[val] += 1

# O(1) output of relevant array values.
print ([(num, freq[num]) for num in range(1000) if freq[num] > 1])

这输出：

[(123, 3), (234, 3), (345, 2)]

尽管您当然可以将输出格式修改为您想要的任何格式。

最后，通过告诉他们 几乎可以肯定 O(n) 解决方案没有 问题，因为上面的代码提供了一个百万位数字字符串的结果不到半秒。它似乎也是线性扩展的，因为一个 10,000,000 个字符的字符串需要 3.5 秒，而一个 100,000,000 个字符的字符串需要 36 秒。

而且，如果他们需要比这更好的东西，有一些方法可以并行化这类可以大大加快速度的东西。

不在单个 Python 解释器中，当然，由于 GIL，但是你可以将字符串拆分成类似的东西（重叠由 vv 是允许正确处理边界区域所必需的）：

您可以将这些分包给不同的工人，然后再合并结果。

输入的拆分和输出的组合可能会淹没小字符串（甚至可能是百万位字符串）的任何节省，但是对于更大的数据集，它可能会有所不同。当然，我常用的 "measure, don't guess" 口头禅也适用于此。

这个咒语也适用于其他可能性，例如完全绕过 Python 并使用可能更快的不同语言。

例如，下面的 C 代码运行在与早期 Python 代码相同的硬件上，在 0.6 秒内处理一百百万位数字，与 Python 代码处理一百万的时间大致相同。换句话说，快很多：

#include <stdio.h>
#include <string.h>

int main(void) {
    static char inpStr[100000000+1];
    static int freq[1000];

    // Set up test data.

    memset(inpStr, '1', sizeof(inpStr));
    inpStr[sizeof(inpStr)-1] = '[=13=]';

    // Need at least three digits to do anything useful.

    if (strlen(inpStr) <= 2) return 0;

    // Get initial feed from first two digits, process others.

    int val = (inpStr[0] - '0') * 10 + inpStr[1] - '0';
    char *inpPtr = &(inpStr[2]);
    while (*inpPtr != '[=13=]') {
        // Remove hundreds, add next digit as units, adjust table.

        val = (val % 100) * 10 + *inpPtr++ - '0';
        freq[val]++;
    }

    // Output (relevant part of) table.

    for (int i = 0; i < 1000; ++i)
        if (freq[i] > 1)
            printf("%3d -> %d\n", i, freq[i]);

    return 0;
}

Answer 8

正如另一个答案中提到的，您不能在常数时间内执行此算法，因为您必须至少查看 n 位数字。线性时间是最快的。

然而，该算法可以在 O(1) space 中完成。您只需要存储每个 3 位数字的计数，因此您需要一个包含 1000 个条目的数组。然后您可以将号码流式传输。

我的猜测是面试官在给你解决方案时说错了，或者你听错了 "constant time" 当他们说 "constant space."

Answer 9

一百万对于我下面给出的答案来说太小了。只期望您必须能够运行面试中的解决方案，没有停顿，那么下面的工作不到两秒钟，并给出了所需的结果：

from collections import Counter

def triple_counter(s):
    c = Counter(s[n-3: n] for n in range(3, len(s)))
    for tri, n in c.most_common():
        if n > 1:
            print('%s - %i times.' % (tri, n))
        else:
            break

if __name__ == '__main__':
    import random

    s = ''.join(random.choice('0123456789') for _ in range(1_000_000))
    triple_counter(s)

希望面试官会寻找标准库的使用 collections.Counter class.

并行执行版本

我写了一个 blog post 对此有更多的解释。

Answer 10

inputStr = '123456123138276237284287434628736482376487234682734682736487263482736487236482634'

count = {}
for i in range(len(inputStr) - 2):
    subNum = int(inputStr[i:i+3])
    if subNum not in count:
        count[subNum] = 1
    else:
        count[subNum] += 1

print count

Answer 11

图片作为答案：

看起来像滑动window。

Answer 12

这是我的解决方案：

from collections import defaultdict
string = "103264685134845354863"
d = defaultdict(int)
for elt in range(len(string)-2):
    d[string[elt:elt+3]] += 1
d = {key: d[key] for key in d.keys() if d[key] > 1}

在 for 循环中发挥一点创意（例如使用 True/False/None 的附加查找列表）你应该能够去掉最后一行，因为你只想在我们访问过的字典中创建键一旦达到这一点。希望对您有所帮助:)

Answer 13

-从C的角度讲。 - 你可以有一个 int 3-d array results[10][10][10]; - 从第 0 个位置到第 n-4 个位置，其中 n 是字符串数组的大小。 - 在每个位置，检查当前、下一个和下一个的下一个。 - 将 cntr 增加为 resutls[current][next][next's next]++; - 打印

的值

results[1][2][3]
results[2][3][4]
results[3][4][5]
results[4][5][6]
results[5][6][7]
results[6][7][8]
results[7][8][9]

-O(n)时间，不涉及比较。 - 您可以通过对数组进行分区并计算分区周围的匹配来运行一些并行的东西。

给定一百万个数字的字符串，return 所有重复的 3 位数字

Given a string of a million numbers, return all repeating 3 digit numbers

python

algorithm

number-theory

data-structures

并行执行版本