按元组分隔符拆分列表
Split list by tuple separator
我有清单:
print (L)
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'),
('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
我想将列表拆分为带有分隔符的子列表 ('.', 'ZZ')
:
print (new_L)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
我对另一种可能的解决方案感兴趣,性能很重要。
我的解决方案是:
from itertools import groupby
sep = ('.','ZZ')
new_L = [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
print (new_L)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
但我相信也存在更好/更快的解决方案。
普通 for
循环应该比 groupby
更快。
L2 = []
for i in L[::-1]:
if i == ('.','ZZ'):
L2.append([])
L2[-1].append(i)
L2 = [x[::-1] for x in L2[::-1]]
一个小调整(may/may-not 提高性能 - 但内存效率更高)涉及使用 reversed
:
L2 = []
sep = ('.','ZZ')
for i in reversed(L):
if i == sep:
L2.append([])
L2[-1].append(i)
L2 = [x[::-1] for x in reversed(L2)]
另一个改进是使用另一个引用来减少 L[-1]
引用:
cache = []
L2 = cache
sep = ('.','ZZ')
for i in reversed(L):
if i == sep:
cache = []
L2.append(cache)
cache.append(i)
L2 = [x[::-1] for x in reversed(L2)]
性能
小
len(L)
8
100000 loops, best of 3: 5.11 µs per loop # groupby
100000 loops, best of 3: 3.54 µs per loop # loop
大
len(L)
800000
1 loop, best of 3: 435 ms per loop # groupby
1 loop, best of 3: 310 ms per loop # PM 2Ring's groupby
1 loop, best of 3: 250 ms per loop # loop
1 loop, best of 3: 235 ms per loop # loop w/ reverse
你的代码在我看来还不错,但你可以通过去掉那个 lambda
来加快速度,例如
groupby(L, sep.__eq__)
不仅代码更短,还节省了创建 lambda 函数的开销,以及相对较慢的 Python 函数调用。
您还可以在循环外构建 [sep]
,这可能会节省几微秒。 ;)
from itertools import groupby
L = [('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'),
('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
sep = ('.','ZZ')
seplist = [sep]
new_L = [list(g) + seplist for k, g in groupby(L, sep.__eq__) if not k]
for row in new_L:
print(row)
输出
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')]
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
a = list()
start = 0
while start < len(l) and (l.index(sep, start) != -1):
end = l.index(sep, start) + 1
a.append(l[start:end])
start = end
这就是我的解决方案。它简单易读。
for 循环方法会更快,这只需要一次:
>>> def juan(L, sep):
... L2 = []
... sub = []
... for x in L:
... sub.append(x)
... if x == sep:
... L2.append(sub)
... sub = []
... if sub:
... L2.append(sub)
... return L2
...
>>> juan(L, sep)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')], [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
一些比较:
>>> def jezrael(L, sub):
... return [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
...
>>> def coldspeed(L, sep):
... L2 = []
... for i in reversed(L):
... if i == sep:
... L2.append([])
... L2[-1].append(i)
... return [x[::-1] for x in reversed(L2)]
...
>>> def pm2ring(L, sep):
... seplist = [sep]
... return [list(g) + seplist for k, g in groupby(L, sep.__eq__) if not k]
...
>>> setup = "from __main__ import L, sep, juan, coldspeed, pm2ring, jezrael"
编辑:更多时间
>>> def buzzycoder(L, sep):
... a = []
... length = len(L)
... start = 0
... end = L.index(sep)
... if start < length: a.append(L[start:end+1])
... start = end + 1
... while start < length:
... end = L.index(sep, start) + 1
... a.append(L[start:end])
... start = end
... return a
...
>>> def splitList(l, s):
... ''' l is list, s is separator, simular to split, but keep separator'''
... i = 0
... for _ in range(l.count(s)): # break using slices
... e = l.index(s,i)
... yield l[i:e+1] # sublist generator value
... i = e+1
... if e+1 < len(l): yield l[e+1:] # pick up
...
>>> def bharath(x,sep):
... n = [0] + [i+1 for i,j in enumerate(x) if j == sep]
... m= list()
... for first, last in zip(n, n[1:]):
... m.append(x[first:last])
... return m
...
结果:
>>> timeit.timeit("jezrael(L, sep)", setup)
4.1499102029483765
>>> timeit.timeit("pm2ring(L, sep)", setup)
3.3499899921007454
>>> timeit.timeit("coldspeed(L, sep)", setup)
2.868469718960114
>>> timeit.timeit("juan(L, sep)", setup)
1.5428746730322018
>>> timeit.timeit("buzzycoder(L, sep)", setup)
1.5942967369919643
>>> timeit.timeit("list(splitList(L, sep))", setup)
2.7872562300181016
>>> timeit.timeit("bharath(L, sep)", setup)
2.9842335029970855
列表更大:
>>> L = L*100000
>>> timeit.timeit("jezrael(L, sep)", setup, number=10)
3.3555950550362468
>>> timeit.timeit("pm2ring(L, sep)", setup, number=10)
2.337177241919562
>>> timeit.timeit("coldspeed(L, sep)", setup, number=10)
2.2037084710318595
>>> timeit.timeit("juan(L, sep)", setup, number=10)
1.3625159269431606
>>> timeit.timeit("buzzycoder(L, sep)", setup, number=10)
1.4375156159512699
>>> timeit.timeit("list(splitList(L, sep))", setup, number=10)
1.6824725979240611
>>> timeit.timeit("bharath(L, sep)", setup, number=10)
1.5603888860205188
警告
考虑到 L
中 sep
的比例,结果不会解决性能问题,这将对其中一些解决方案的时序产生很大影响。
使用生成器和切片非常快:
def splitList(l, s):
''' l is list, s is separator, simular to split, but keep separator'''
i = 0
for _ in range(l.count(s)): # break using slices
e = l.index(s,i)
yield l[i:e+1] # sublist generator value
i = e+1
if e+1 < len(l): yield l[e+1:] # pick up any list left over
l = [('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'),
('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
print(list(splitList(l, ('.', 'ZZ'))))
您还可以与其他列表和分隔符一起使用。
l = ['tom','dick','x',"harry",'x','sally','too']
print(list(splitList(l, 'x')))
另一种解决方案,使用 zip 枚举和创建配对,即
def bharath(x,sep):
n = [0] + [i+1 for i,j in enumerate(x) if j == sep]
m= list()
for first, last in zip(n, n[1:]):
m.append(x[first:last])
return m
%%timeit
bharath(L,('.','ZZ'))
100000 loops, best of 3: 3.74 µs per loop
L = L*100000
bharath(L,('.','ZZ'))
1 loop, best of 3: 240 ms per loop
我有清单:
print (L)
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'),
('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
我想将列表拆分为带有分隔符的子列表 ('.', 'ZZ')
:
print (new_L)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
我对另一种可能的解决方案感兴趣,性能很重要。
我的解决方案是:
from itertools import groupby
sep = ('.','ZZ')
new_L = [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
print (new_L)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')],
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
但我相信也存在更好/更快的解决方案。
普通 for
循环应该比 groupby
更快。
L2 = []
for i in L[::-1]:
if i == ('.','ZZ'):
L2.append([])
L2[-1].append(i)
L2 = [x[::-1] for x in L2[::-1]]
一个小调整(may/may-not 提高性能 - 但内存效率更高)涉及使用 reversed
:
L2 = []
sep = ('.','ZZ')
for i in reversed(L):
if i == sep:
L2.append([])
L2[-1].append(i)
L2 = [x[::-1] for x in reversed(L2)]
另一个改进是使用另一个引用来减少 L[-1]
引用:
cache = []
L2 = cache
sep = ('.','ZZ')
for i in reversed(L):
if i == sep:
cache = []
L2.append(cache)
cache.append(i)
L2 = [x[::-1] for x in reversed(L2)]
性能
小
len(L)
8
100000 loops, best of 3: 5.11 µs per loop # groupby
100000 loops, best of 3: 3.54 µs per loop # loop
大
len(L)
800000
1 loop, best of 3: 435 ms per loop # groupby
1 loop, best of 3: 310 ms per loop # PM 2Ring's groupby
1 loop, best of 3: 250 ms per loop # loop
1 loop, best of 3: 235 ms per loop # loop w/ reverse
你的代码在我看来还不错,但你可以通过去掉那个 lambda
来加快速度,例如
groupby(L, sep.__eq__)
不仅代码更短,还节省了创建 lambda 函数的开销,以及相对较慢的 Python 函数调用。
您还可以在循环外构建 [sep]
,这可能会节省几微秒。 ;)
from itertools import groupby
L = [('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'),
('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
sep = ('.','ZZ')
seplist = [sep]
new_L = [list(g) + seplist for k, g in groupby(L, sep.__eq__) if not k]
for row in new_L:
print(row)
输出
[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')]
[('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
a = list()
start = 0
while start < len(l) and (l.index(sep, start) != -1):
end = l.index(sep, start) + 1
a.append(l[start:end])
start = end
这就是我的解决方案。它简单易读。
for 循环方法会更快,这只需要一次:
>>> def juan(L, sep):
... L2 = []
... sub = []
... for x in L:
... sub.append(x)
... if x == sep:
... L2.append(sub)
... sub = []
... if sub:
... L2.append(sub)
... return L2
...
>>> juan(L, sep)
[[('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ')], [('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]]
一些比较:
>>> def jezrael(L, sub):
... return [list(g) + [sep] for k, g in groupby(L, lambda x: x==sep) if not k]
...
>>> def coldspeed(L, sep):
... L2 = []
... for i in reversed(L):
... if i == sep:
... L2.append([])
... L2[-1].append(i)
... return [x[::-1] for x in reversed(L2)]
...
>>> def pm2ring(L, sep):
... seplist = [sep]
... return [list(g) + seplist for k, g in groupby(L, sep.__eq__) if not k]
...
>>> setup = "from __main__ import L, sep, juan, coldspeed, pm2ring, jezrael"
编辑:更多时间
>>> def buzzycoder(L, sep):
... a = []
... length = len(L)
... start = 0
... end = L.index(sep)
... if start < length: a.append(L[start:end+1])
... start = end + 1
... while start < length:
... end = L.index(sep, start) + 1
... a.append(L[start:end])
... start = end
... return a
...
>>> def splitList(l, s):
... ''' l is list, s is separator, simular to split, but keep separator'''
... i = 0
... for _ in range(l.count(s)): # break using slices
... e = l.index(s,i)
... yield l[i:e+1] # sublist generator value
... i = e+1
... if e+1 < len(l): yield l[e+1:] # pick up
...
>>> def bharath(x,sep):
... n = [0] + [i+1 for i,j in enumerate(x) if j == sep]
... m= list()
... for first, last in zip(n, n[1:]):
... m.append(x[first:last])
... return m
...
结果:
>>> timeit.timeit("jezrael(L, sep)", setup)
4.1499102029483765
>>> timeit.timeit("pm2ring(L, sep)", setup)
3.3499899921007454
>>> timeit.timeit("coldspeed(L, sep)", setup)
2.868469718960114
>>> timeit.timeit("juan(L, sep)", setup)
1.5428746730322018
>>> timeit.timeit("buzzycoder(L, sep)", setup)
1.5942967369919643
>>> timeit.timeit("list(splitList(L, sep))", setup)
2.7872562300181016
>>> timeit.timeit("bharath(L, sep)", setup)
2.9842335029970855
列表更大:
>>> L = L*100000
>>> timeit.timeit("jezrael(L, sep)", setup, number=10)
3.3555950550362468
>>> timeit.timeit("pm2ring(L, sep)", setup, number=10)
2.337177241919562
>>> timeit.timeit("coldspeed(L, sep)", setup, number=10)
2.2037084710318595
>>> timeit.timeit("juan(L, sep)", setup, number=10)
1.3625159269431606
>>> timeit.timeit("buzzycoder(L, sep)", setup, number=10)
1.4375156159512699
>>> timeit.timeit("list(splitList(L, sep))", setup, number=10)
1.6824725979240611
>>> timeit.timeit("bharath(L, sep)", setup, number=10)
1.5603888860205188
警告
考虑到L
中 sep
的比例,结果不会解决性能问题,这将对其中一些解决方案的时序产生很大影响。
使用生成器和切片非常快:
def splitList(l, s):
''' l is list, s is separator, simular to split, but keep separator'''
i = 0
for _ in range(l.count(s)): # break using slices
e = l.index(s,i)
yield l[i:e+1] # sublist generator value
i = e+1
if e+1 < len(l): yield l[e+1:] # pick up any list left over
l = [('I', 'WW'), ('am', 'XX'), ('newbie', 'YY'), ('.', 'ZZ'),
('You', 'WW'), ('are', 'XX'), ('cool', 'YY'), ('.', 'ZZ')]
print(list(splitList(l, ('.', 'ZZ'))))
您还可以与其他列表和分隔符一起使用。
l = ['tom','dick','x',"harry",'x','sally','too']
print(list(splitList(l, 'x')))
另一种解决方案,使用 zip 枚举和创建配对,即
def bharath(x,sep):
n = [0] + [i+1 for i,j in enumerate(x) if j == sep]
m= list()
for first, last in zip(n, n[1:]):
m.append(x[first:last])
return m
%%timeit
bharath(L,('.','ZZ'))
100000 loops, best of 3: 3.74 µs per loop
L = L*100000
bharath(L,('.','ZZ'))
1 loop, best of 3: 240 ms per loop