如何在 python 中拆分带有多个分隔符的字符串?
How to split a string with many delimiter in python?
我想通过删除除字母字符之外的所有内容来拆分字符串。
默认情况下,split
仅按单词之间的空格拆分。但我想按字母字符以外的所有内容拆分。如何向 split
添加多个分隔符?
例如:
word1 = input().lower().split()
# if you input " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
#the result will be ['has', '15', 'science@and^engineering--departments,', 'affiliated', 'centers,', 'bandar', 'abbas&&and', 'mahshahr.']
但我正在寻找这样的结果:
['has', '15', 'science', 'and', 'engineering', 'departments', 'affiliated', 'centers', 'bandar', 'abbas', 'and', 'mahshahr']
为了提高性能,您应该根据标记的重复项使用正则表达式。请参阅下面的基准测试。
groupby + str.isalnum
您可以使用 itertools.groupby
with str.isalnum
按字母数字字符分组。
使用此解决方案,您不必担心按明确指定的字符拆分。
from itertools import groupby
x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
res = [''.join(j) for i, j in groupby(x, key=str.isalnum) if i]
print(res)
['has', '15', 'science', 'and', 'engineering', 'departments',
'affiliated', 'centers', 'Bandar', 'Abbas', 'and', 'Mahshahr']
基准测试与正则表达式
一些性能基准测试与正则表达式解决方案(在 Python 3.6.5 上测试):
from itertools import groupby
import re
x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
z = x*10000
%timeit [''.join(j) for i, j in groupby(z, key=str.isalnum) if i] # 184 ms
%timeit list(filter(None, re.sub(r'\W+', ',', z).split(','))) # 82.1 ms
%timeit list(filter(None, re.split('\W+', z))) # 63.6 ms
%timeit [_ for _ in re.split(r'\W', z) if _] # 62.9 ms
您可以用单个字符替换所有非字母数字字符(我使用的是逗号)
s = 'has15science@and^engineering--departments,affiliatedcenters,bandarabbas&&andmahshahr.'
alphanumeric = re.sub(r'\W+', ',',s)
然后用逗号分隔:
splitted = alphanumeric.split(',')
编辑:
正如@DeepSpace 所建议的,这可以在一条语句中完成:
splitted = re.split('\W+', s)
我想通过删除除字母字符之外的所有内容来拆分字符串。
默认情况下,split
仅按单词之间的空格拆分。但我想按字母字符以外的所有内容拆分。如何向 split
添加多个分隔符?
例如:
word1 = input().lower().split()
# if you input " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
#the result will be ['has', '15', 'science@and^engineering--departments,', 'affiliated', 'centers,', 'bandar', 'abbas&&and', 'mahshahr.']
但我正在寻找这样的结果:
['has', '15', 'science', 'and', 'engineering', 'departments', 'affiliated', 'centers', 'bandar', 'abbas', 'and', 'mahshahr']
为了提高性能,您应该根据标记的重复项使用正则表达式。请参阅下面的基准测试。
groupby + str.isalnum
您可以使用 itertools.groupby
with str.isalnum
按字母数字字符分组。
使用此解决方案,您不必担心按明确指定的字符拆分。
from itertools import groupby
x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
res = [''.join(j) for i, j in groupby(x, key=str.isalnum) if i]
print(res)
['has', '15', 'science', 'and', 'engineering', 'departments',
'affiliated', 'centers', 'Bandar', 'Abbas', 'and', 'Mahshahr']
基准测试与正则表达式
一些性能基准测试与正则表达式解决方案(在 Python 3.6.5 上测试):
from itertools import groupby
import re
x = " has 15 science@and^engineering--departments, affiliated centers, Bandar Abbas&&and Mahshahr."
z = x*10000
%timeit [''.join(j) for i, j in groupby(z, key=str.isalnum) if i] # 184 ms
%timeit list(filter(None, re.sub(r'\W+', ',', z).split(','))) # 82.1 ms
%timeit list(filter(None, re.split('\W+', z))) # 63.6 ms
%timeit [_ for _ in re.split(r'\W', z) if _] # 62.9 ms
您可以用单个字符替换所有非字母数字字符(我使用的是逗号)
s = 'has15science@and^engineering--departments,affiliatedcenters,bandarabbas&&andmahshahr.'
alphanumeric = re.sub(r'\W+', ',',s)
然后用逗号分隔:
splitted = alphanumeric.split(',')
编辑:
正如@DeepSpace 所建议的,这可以在一条语句中完成:
splitted = re.split('\W+', s)