如何找到字符串中任何一组字符的第一个索引
How to find the first index of any of a set of characters in a string
我想找到字符串中任何“特殊”字符第一次出现的索引,如下所示:
>>> "Hello world!".index([' ', '!'])
5
…除了那不是有效的 Python 语法。当然,我可以编写一个函数来模拟这种行为:
def first_index(s, characters):
i = []
for c in characters:
try:
i.append(s.index(c))
except ValueError:
pass
if not i:
raise ValueError
return min(i)
我也可以使用正则表达式,但这两种解决方案似乎都有些矫枉过正。在 Python 中是否有任何“理智”的方法来做到这一点?
使用 gen-exp 和 find
方法。
>>> a = [' ', '!']
>>> s = "Hello World!"
>>> min(s.find(i) for i in a)
5
若要删除 -1
,您可以在列表 comp
中设置过滤器
>>> a = [' ', '!','$']
>>> s = "Hello World!"
>>> min(s.find(i) for i in a if i in s)
5
或者你可以替换 None
>>> min(s.find(i) if i in s else None for i in a)
5
添加 timeit
个结果
$ python -m timeit "a = [' ', '\!'];s = 'Hello World\!';min(s.find(i) for i in a if i in s)"
1000000 loops, best of 3: 0.902 usec per loop
$ python -m timeit "a = [' ', '\!'];s = 'Hello World\!';next((i for i, ch in enumerate(s) if ch in a),None)"
1000000 loops, best of 3: 1.25 usec per loop
$ python -m timeit "a = [' ', '\!'];s = 'Hello World\!';min(map(lambda x: (s.index(x) if (x in s) else len(s)), a))"
1000000 loops, best of 3: 1.12 usec per loop
在您的 Example 案例中,Padraic 的漂亮解决方案有点慢。但是在大型测试用例中,它绝对是赢家。 (有点意外的是alfasin的"Not as optimized"这里也更快)
添加了实施细节
>>> def take1(s,a):
... min(s.find(i) for i in a if i in s)
...
>>> import dis
>>> dis.dis(take1)
2 0 LOAD_GLOBAL 0 (min)
3 LOAD_CLOSURE 0 (s)
6 BUILD_TUPLE 1
9 LOAD_CONST 1 (<code object <genexpr> at 0x7fa622e961b0, file "<stdin>", line 2>)
12 MAKE_CLOSURE 0
15 LOAD_FAST 1 (a)
18 GET_ITER
19 CALL_FUNCTION 1
22 CALL_FUNCTION 1
25 POP_TOP
26 LOAD_CONST 0 (None)
29 RETURN_VALUE
>>> def take2(s,a):
... next((i for i, ch in enumerate(s) if ch in a),None)
...
>>> dis.dis(take2)
2 0 LOAD_GLOBAL 0 (next)
3 LOAD_CLOSURE 0 (a)
6 BUILD_TUPLE 1
9 LOAD_CONST 1 (<code object <genexpr> at 0x7fa622e96e30, file "<stdin>", line 2>)
12 MAKE_CLOSURE 0
15 LOAD_GLOBAL 1 (enumerate)
18 LOAD_FAST 0 (s)
21 CALL_FUNCTION 1
24 GET_ITER
25 CALL_FUNCTION 1
28 LOAD_CONST 0 (None)
31 CALL_FUNCTION 2
34 POP_TOP
35 LOAD_CONST 0 (None)
38 RETURN_VALUE
>>> def take3(s,a):
... min(map(lambda x: (s.index(x) if (x in s) else len(s)), a))
...
>>> dis.dis(take3)
2 0 LOAD_GLOBAL 0 (min)
3 LOAD_GLOBAL 1 (map)
6 LOAD_CLOSURE 0 (s)
9 BUILD_TUPLE 1
12 LOAD_CONST 1 (<code object <lambda> at 0x7fa622e44eb0, file "<stdin>", line 2>)
15 MAKE_CLOSURE 0
18 LOAD_FAST 1 (a)
21 CALL_FUNCTION 2
24 CALL_FUNCTION 1
27 POP_TOP
28 LOAD_CONST 0 (None)
31 RETURN_VALUE
正如您在 Padraic 的案例中清楚地看到的那样,全局函数 next
和 enumerate
的加载与最后的 None
一起消磨时间。在 alfasin 的解决方案中,主要的减速是 lambda
函数。
您可以使用 enumerate and next with a generator expression,获取第一个匹配项或 returning None 如果没有字符出现在 s:
s = "Hello world!"
st = {"!"," "}
ind = next((i for i, ch in enumerate(s) if ch in st),None)
print(ind)
如果没有匹配项,您可以将您想要的任何值作为默认值 return 传递给下一个值。
如果您想使用函数并引发 ValueError:
def first_index(s, characters):
st = set(characters)
ind = next((i for i, ch in enumerate(s) if ch in st), None)
if ind is not None:
return ind
raise ValueError
对于较小的输入,使用集合不会有太大区别,但对于较大的字符串,它会更有效。
一些时间:
字符串中,字符集的最后一个字符:
In [40]: s = "Hello world!" * 100
In [41]: string = s
In [42]: %%timeit
st = {"x","y","!"}
next((i for i, ch in enumerate(s) if ch in st), None)
....:
1000000 loops, best of 3: 1.71 µs per loop
In [43]: %%timeit
specials = ['x', 'y', '!']
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
....:
100000 loops, best of 3: 2.64 µs per loop
不在字符串中,更大的字符集:
In [44]: %%timeit
st = {"u","v","w","x","y","z"}
next((i for i, ch in enumerate(s) if ch in st), None)
....:
1000000 loops, best of 3: 1.49 µs per loop
In [45]: %%timeit
specials = ["u","v","w","x","y","z"]
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
....:
100000 loops, best of 3: 5.48 µs per loop
字符串中字符集的第一个字符:
In [47]: %%timeit
specials = ['H', 'y', '!']
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
....:
100000 loops, best of 3: 2.02 µs per loop
In [48]: %%timeit
st = {"H","y","!"}
next((i for i, ch in enumerate(s) if ch in st), None)
....:
1000000 loops, best of 3: 903 ns per loop
不像 Padraic Cunningham 的解决方案那么优化,但仍然是一个线性:
string = "Hello world!"
specials = [' ', '!', 'x']
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
我更喜欢 re
模块,因为它是内置的并且已经过测试。它也针对这类事情进行了优化。
>>> import re
>>> re.search(r'[ !]', 'Hello World!').start()
5
您可能想要检查是否找到了匹配项,或者在未找到匹配项时捕获异常。
不使用 re
是有原因的,但我希望看到一个很好的评论来证明合理性。认为自己可以 "do it better" 通常是不必要的,这会使其他人更难阅读代码并且更难维护。
我想找到字符串中任何“特殊”字符第一次出现的索引,如下所示:
>>> "Hello world!".index([' ', '!'])
5
…除了那不是有效的 Python 语法。当然,我可以编写一个函数来模拟这种行为:
def first_index(s, characters):
i = []
for c in characters:
try:
i.append(s.index(c))
except ValueError:
pass
if not i:
raise ValueError
return min(i)
我也可以使用正则表达式,但这两种解决方案似乎都有些矫枉过正。在 Python 中是否有任何“理智”的方法来做到这一点?
使用 gen-exp 和 find
方法。
>>> a = [' ', '!']
>>> s = "Hello World!"
>>> min(s.find(i) for i in a)
5
若要删除 -1
,您可以在列表 comp
>>> a = [' ', '!','$']
>>> s = "Hello World!"
>>> min(s.find(i) for i in a if i in s)
5
或者你可以替换 None
>>> min(s.find(i) if i in s else None for i in a)
5
添加 timeit
个结果
$ python -m timeit "a = [' ', '\!'];s = 'Hello World\!';min(s.find(i) for i in a if i in s)"
1000000 loops, best of 3: 0.902 usec per loop
$ python -m timeit "a = [' ', '\!'];s = 'Hello World\!';next((i for i, ch in enumerate(s) if ch in a),None)"
1000000 loops, best of 3: 1.25 usec per loop
$ python -m timeit "a = [' ', '\!'];s = 'Hello World\!';min(map(lambda x: (s.index(x) if (x in s) else len(s)), a))"
1000000 loops, best of 3: 1.12 usec per loop
在您的 Example 案例中,Padraic 的漂亮解决方案有点慢。但是在大型测试用例中,它绝对是赢家。 (有点意外的是alfasin的"Not as optimized"这里也更快)
添加了实施细节
>>> def take1(s,a):
... min(s.find(i) for i in a if i in s)
...
>>> import dis
>>> dis.dis(take1)
2 0 LOAD_GLOBAL 0 (min)
3 LOAD_CLOSURE 0 (s)
6 BUILD_TUPLE 1
9 LOAD_CONST 1 (<code object <genexpr> at 0x7fa622e961b0, file "<stdin>", line 2>)
12 MAKE_CLOSURE 0
15 LOAD_FAST 1 (a)
18 GET_ITER
19 CALL_FUNCTION 1
22 CALL_FUNCTION 1
25 POP_TOP
26 LOAD_CONST 0 (None)
29 RETURN_VALUE
>>> def take2(s,a):
... next((i for i, ch in enumerate(s) if ch in a),None)
...
>>> dis.dis(take2)
2 0 LOAD_GLOBAL 0 (next)
3 LOAD_CLOSURE 0 (a)
6 BUILD_TUPLE 1
9 LOAD_CONST 1 (<code object <genexpr> at 0x7fa622e96e30, file "<stdin>", line 2>)
12 MAKE_CLOSURE 0
15 LOAD_GLOBAL 1 (enumerate)
18 LOAD_FAST 0 (s)
21 CALL_FUNCTION 1
24 GET_ITER
25 CALL_FUNCTION 1
28 LOAD_CONST 0 (None)
31 CALL_FUNCTION 2
34 POP_TOP
35 LOAD_CONST 0 (None)
38 RETURN_VALUE
>>> def take3(s,a):
... min(map(lambda x: (s.index(x) if (x in s) else len(s)), a))
...
>>> dis.dis(take3)
2 0 LOAD_GLOBAL 0 (min)
3 LOAD_GLOBAL 1 (map)
6 LOAD_CLOSURE 0 (s)
9 BUILD_TUPLE 1
12 LOAD_CONST 1 (<code object <lambda> at 0x7fa622e44eb0, file "<stdin>", line 2>)
15 MAKE_CLOSURE 0
18 LOAD_FAST 1 (a)
21 CALL_FUNCTION 2
24 CALL_FUNCTION 1
27 POP_TOP
28 LOAD_CONST 0 (None)
31 RETURN_VALUE
正如您在 Padraic 的案例中清楚地看到的那样,全局函数 next
和 enumerate
的加载与最后的 None
一起消磨时间。在 alfasin 的解决方案中,主要的减速是 lambda
函数。
您可以使用 enumerate and next with a generator expression,获取第一个匹配项或 returning None 如果没有字符出现在 s:
s = "Hello world!"
st = {"!"," "}
ind = next((i for i, ch in enumerate(s) if ch in st),None)
print(ind)
如果没有匹配项,您可以将您想要的任何值作为默认值 return 传递给下一个值。
如果您想使用函数并引发 ValueError:
def first_index(s, characters):
st = set(characters)
ind = next((i for i, ch in enumerate(s) if ch in st), None)
if ind is not None:
return ind
raise ValueError
对于较小的输入,使用集合不会有太大区别,但对于较大的字符串,它会更有效。
一些时间:
字符串中,字符集的最后一个字符:
In [40]: s = "Hello world!" * 100
In [41]: string = s
In [42]: %%timeit
st = {"x","y","!"}
next((i for i, ch in enumerate(s) if ch in st), None)
....:
1000000 loops, best of 3: 1.71 µs per loop
In [43]: %%timeit
specials = ['x', 'y', '!']
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
....:
100000 loops, best of 3: 2.64 µs per loop
不在字符串中,更大的字符集:
In [44]: %%timeit
st = {"u","v","w","x","y","z"}
next((i for i, ch in enumerate(s) if ch in st), None)
....:
1000000 loops, best of 3: 1.49 µs per loop
In [45]: %%timeit
specials = ["u","v","w","x","y","z"]
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
....:
100000 loops, best of 3: 5.48 µs per loop
字符串中字符集的第一个字符:
In [47]: %%timeit
specials = ['H', 'y', '!']
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
....:
100000 loops, best of 3: 2.02 µs per loop
In [48]: %%timeit
st = {"H","y","!"}
next((i for i, ch in enumerate(s) if ch in st), None)
....:
1000000 loops, best of 3: 903 ns per loop
不像 Padraic Cunningham 的解决方案那么优化,但仍然是一个线性:
string = "Hello world!"
specials = [' ', '!', 'x']
min(map(lambda x: (string.index(x) if (x in string) else len(string)), specials))
我更喜欢 re
模块,因为它是内置的并且已经过测试。它也针对这类事情进行了优化。
>>> import re
>>> re.search(r'[ !]', 'Hello World!').start()
5
您可能想要检查是否找到了匹配项,或者在未找到匹配项时捕获异常。
不使用 re
是有原因的,但我希望看到一个很好的评论来证明合理性。认为自己可以 "do it better" 通常是不必要的,这会使其他人更难阅读代码并且更难维护。