大海捞针，什么是更好的解决方案？

Question

所以给定 "needle" 和 "there is a needle in this but not thisneedle haystack"

我写了

def find_needle(n,h):
    count = 0
    words = h.split(" ")
    for word in words:
        if word == n:
            count += 1
    return count

这是 O(n) 但想知道是否有更好的方法？也许根本不用拆分？

您将如何为这种情况编写测试以检查它是否处理所有边缘情况？

Answer 1

这并没有解决复杂性问题，而是简化了代码：

def find_needle(n,h):
    return h.split().count(n)

Answer 2

您可以使用Counter

from collections import Counter

def find_needle(n,h):
    return Counter(h.split())[n]

即：

n = "portugal"
h = 'lobito programmer from portugal hello fromportugal portugal'

print find_needle(n,h)

输出：

DEMO

Answer 3

这仍然是 O(n)，但它使用了 re 模块和 python 的生成器表达式的强大功能。

import re

def find_needle(n,h):
    g = re.finditer(r'\b%s\b'%n, h)  # use regex word boundaries
    return sum(1 for _ in g)  # return the length of the iterator

对于相对较大的 'haystack'。

使用的内存应该比 .split 少得多

请注意，这与OP中的代码并不完全相同，因为它不仅会找到'needle'而且还会找到'needle,'和'needle.'它不会找到'needles' 不过

Answer 4

我认为用这个 O(n) 是不可能的（因为你需要至少遍历字符串一次）。你可以做一些优化。

我假设你想匹配“whole words”，例如查找 foo 应该像这样匹配：

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

所以仅基于 space 的夹板无法完成这项工作，因为：

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

这就是 re module 派上用场的地方，它可以让您创造迷人的条件。例如，正则表达式中的 \b 表示：

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

因此 r'\bfoo\b' 将仅匹配 整个单词 foo。也不要忘记使用 re.escape():

>>> re.escape('foo.bar+')
'foo\.bar\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\bfoo\.bar\+\b'

您现在要做的就是使用re.finditer() 扫描字符串。基于文档：

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

我假设匹配项是动态生成的，因此它们永远不必立即存储在内存中（这可能会派上用场 large 字符串，有很多匹配项）。最后数一下：

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3

Answer 5

实际上，当您说 O(n) 时，您忘记了这样一个事实，即在匹配第一个字母之后，您还必须匹配其余的字母（从针到句子匹配 n，然后匹配 e，然后匹配下一个e...) 您本质上是在尝试复制 grep 的功能，因此您可以查看 grep 算法。您可以通过构建有限状态机来做得很好。有很多链接可以帮助您，您可以从 How does grep run so fast?

开始

Answer 6

如果您关心它所花费的时间（不同于时间复杂度），请多处理它。基本上使 n 变小。这是在 2 个进程中运行的示例。

from multiprocessing import Process

def find(word, string):
    return string.count(word)

def search_for_words(word, string):
    full_length = len(string)
    part1 = string[:full_length/2]
    proc1 = Process(target=find, args=(word, part1,))
    proc1.start()
    part2 = string[full_lenght/2:]
    proc2 = Process(target=find, args=(word, part2,))
    proc2.start()
    proc1.join()
    proc2.join()

如果你担心它的 O(n) - 那么，我不确定你能做多少，除非有可能在另一个数据结构中获取字符串。比如一套什么的。（但是把它放在那个集合中也是O（n），如果你已经在其他地方迭代字符串，你可以节省时间，然后再制作这个结构。一次写入，多次读取。

Answer 7

为了保证大海捞针，你需要检查每一片干草，直到找到针。这是 O(n) 无论如何，一个严格的下限。

Answer 8

def find_needle(haystack):
    for item in haystack:
        if item  == 'needle':
            haystack.append(item)
            return 'found the needle at position ' + str(haystack.index(item))

Answer 9

这是我的。

def find_needle(haystack, needle):
    return haystack.count(needele)

这里，我们简单地使用内置的计数方法来计算大海捞针的数量。

大海捞针，什么是更好的解决方案？

finding needle in haystack, what is a better solution?

python

dynamic-programming