常数时间内的字符串比较

Question

我在考虑比较两个字符串的更快方法。检查值是否存在于 Python 集合（散列 table）中具有常数时间。这是否意味着在 set 中查找字符串也有常数时间？

print('tya' == 'tya') #O(n)
mySet = set()
mySet.add('tya')
if 'tya' in mySet: #O(1) <-- ???
    print('True')

在更一般的情况下，这是否意味着我可以在线性时间内找到字符串中的子字符串？？？

def NaiveFind(largeString, subString):
    mySet = set()
    mySet.add(subString)
    index = -1
    start = 0
    end = len(subString)

    while(end < len(largeString)): #O(n-m)
        windowFromLarge = largeString[start:end]
        if(windowFromLarge in mySet): #O(1) <------- LINEAR ???
        #if(windowFromLarge == subString): #O(m)
            return start
        start += 1
        end += 1

    return index

Answer 1

你说

Checking if value exists in Python set (hash table) has constant time.

但这是一种常见的过度简化，因为人们没有意识到他们正在这样做，或者因为每次都说出实际行为会花费更长的时间。

检查一个值是否存在于 Python 集合中需要一个 平均情况常数 数量的 散列运算和相等比较，假设散列冲突不会失控。它不会自动使散列运算和相等比较成为常量时间。

您的 NaiveFind 算法不是线性时间，因为您忽略了哈希计算的成本（也因为字符串切片需要 CPython 中的副本）。 Rabin-Karp algorithm uses a refined version of your idea, where the hash is a rolling hash to avoid this problem. The Rabin-Karp algorithm is average-case linear time, as long as hash collisions don't get out of hand. There are also algorithms like Knuth-Morris-Pratt that are guaranteed linear time, and algorithms like Boyer-Moore在通常情况下可以比线性时间更好。

常数时间内的字符串比较

String comparison in constant time

python

string-comparison

python-3.x