如何在 python 中拆分 CamelCase
How to do CamelCase split in python
我试图实现的目标是这样的:
>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
所以我搜索并找到了这个 perfect regular expression:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
作为我尝试的下一个合乎逻辑的步骤:
>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
为什么这不起作用,如何从 python 中的链接问题中获得结果?
编辑:解决方案摘要
我用几个测试用例测试了所有提供的解决方案:
string: ''
AplusKminus: ['']
casimir_et_hippolyte: []
two_hundred_success: []
kalefranz: string index out of range # with modification: either [] or ['']
string: ' '
AplusKminus: [' ']
casimir_et_hippolyte: []
two_hundred_success: [' ']
kalefranz: [' ']
string: 'lower'
all algorithms: ['lower']
string: 'UPPER'
all algorithms: ['UPPER']
string: 'Initial'
all algorithms: ['Initial']
string: 'dromedaryCase'
AplusKminus: ['dromedary', 'Case']
casimir_et_hippolyte: ['dromedary', 'Case']
two_hundred_success: ['dromedary', 'Case']
kalefranz: ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']
string: 'CamelCase'
all algorithms: ['Camel', 'Case']
string: 'ABCWordDEF'
AplusKminus: ['ABC', 'Word', 'DEF']
casimir_et_hippolyte: ['ABC', 'Word', 'DEF']
two_hundred_success: ['ABC', 'Word', 'DEF']
kalefranz: ['ABCWord', 'DEF']
总之,您可以说@kalefranz 的解决方案与问题不匹配(参见最后一个案例),而@casimir et hippolyte 的解决方案只吃一个 space,因此违反了一个想法拆分不应更改各个部分。其余两个备选方案之间的唯一区别是,我的解决方案 returns 是一个空字符串输入为空字符串的列表,@200_success returns 的解决方案是一个空列表。
我不知道 python 社区在这个问题上的立场如何,所以我说:我对任何一个都满意。由于 200_success 的解决方案更简单,我接受它作为正确答案。
python 的 re.split
的 documentation 说:
Note that split will never split a string on an empty pattern match.
看到这个时:
>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']
很明显,为什么拆分没有按预期进行。 re
模块按照正则表达式的意图查找空匹配项。
由于文档指出这不是错误,而是预期的行为,因此在尝试创建驼峰式拆分时必须解决此问题:
def camel_case_split(identifier):
matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
split_string = []
# index of beginning of slice
previous = 0
for match in matches:
# get slice
split_string.append(identifier[previous:match.start()])
# advance index
previous = match.start()
# get remaining string
split_string.append(identifier[previous:])
return split_string
这是另一个需要较少代码且不需要复杂正则表达式的解决方案:
def camel_case_split(string):
bldrs = [[string[0].upper()]]
for c in string[1:]:
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
else:
bldrs[-1].append(c)
return [''.join(bldr) for bldr in bldrs]
编辑
以上代码包含一个优化,可避免使用每个附加字符重建整个字符串。撇开那个优化,一个更简单的版本(带注释)可能看起来像
def camel_case_split2(string):
# set the logic for creating a "break"
def is_transition(c1, c2):
return c1.islower() and c2.isupper()
# start the builder list with the first character
# enforce upper case
bldr = [string[0].upper()]
for c in string[1:]:
# get the last character in the last element in the builder
# note that strings can be addressed just like lists
previous_character = bldr[-1][-1]
if is_transition(previous_character, c):
# start a new element in the list
bldr.append(c)
else:
# append the character to the last string
bldr[-1] += c
return bldr
正如@AplusKminus 所解释的,re.split()
永远不会在空模式匹配上分裂。因此,与其拆分,不如尝试找到您感兴趣的组件。
这是一个使用 re.finditer()
模拟拆分的解决方案:
def camel_case_split(identifier):
matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
return [m.group(0) for m in matches]
大多数时候不需要检查字符串的格式,全局研究比拆分更简单(对于相同的结果):
re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
returns
['Camel', 'Case', 'XYZ']
要处理单峰骆驼,您可以使用:
re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
注意:(?=[A-Z]|$)
可以使用双重否定来缩短(带有否定字符 class 的否定前瞻):(?![^A-Z])
使用re.sub()
和split()
import re
name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' ', re.sub('([A-Z]+)', r' ', name)).split()
结果
'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
我只是偶然发现了这个案例,并写了一个正则表达式来解决它。实际上,它应该适用于任何一组单词。
RE_WORDS = re.compile(r'''
# Find words in a string. Order matters!
[A-Z]+(?=[A-Z][a-z]) | # All upper case before a capitalized word
[A-Z]?[a-z]+ | # Capitalized words / all lower case
[A-Z]+ | # All upper case
\d+ # Numbers
''', re.VERBOSE)
这里的关键是第一种可能情况的前瞻。它将匹配(并保留)大写单词之前的大写单词:
assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
我认为下面是最优化的
Def count_word():
Return(re.findall(‘[A-Z]?[a-z]+’, input(‘请输入你的字符串’))
打印(count_word())
我知道问题添加了正则表达式的标签。但是,我总是尽量远离正则表达式。所以,这是我没有正则表达式的解决方案:
def split_camel(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
编辑:
正如建议的那样,这里是将功能放在单个函数中的代码。
def split_camel(text):
def splitter(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
converted_text = reduce(splitter, text, "")
return converted_text.split(" ")
split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
提出更全面的方法。它处理了几个问题,例如数字、以小写字母开头的字符串、单字母单词等。
def camel_case_split(identifier, remove_single_letter_words=False):
"""Parses CamelCase and Snake naming"""
concat_words = re.split('[^a-zA-Z]+', identifier)
def camel_case_split(string):
bldrs = [[string[0].upper()]]
string = string[1:]
for idx, c in enumerate(string):
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
bldrs.append([c])
else:
bldrs[-1].append(c)
words = [''.join(bldr) for bldr in bldrs]
words = [word.lower() for word in words]
return words
words = []
for word in concat_words:
if len(word) > 0:
words.extend(camel_case_split(word))
if remove_single_letter_words:
subset_words = []
for word in words:
if len(word) > 1:
subset_words.append(word)
if len(subset_words) > 0:
words = subset_words
return words
我的要求比 OP 更具体一些。特别是,除了处理所有 OP 案例外,我还需要其他解决方案不提供的以下内容:
- 将所有非字母数字输入(例如 !@#$%^&*() 等)视为单词分隔符
- 按如下方式处理数字:
- 不能在单词中间
- 不能在单词的开头,除非短语以数字开头
def splitWords(s):
new_s = re.sub(r'[^a-zA-Z0-9]', ' ', # not alphanumeric
re.sub(r'([0-9]+)([^0-9])', '\1 \2', # digit followed by non-digit
re.sub(r'([a-z])([A-Z])','\1 \2', # lower case followed by upper case
re.sub(r'([A-Z])([A-Z][a-z])', '\1 \2', # upper case followed by upper case followed by lower case
s
)
)
)
)
return [x for x in new_s.split(' ') if x]
输出:
for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
print test + ':' + str(splitWords(test))
:[]
:[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
没有正则表达式的工作解决方案
我不太擅长正则表达式。我喜欢在我的 IDE 中将它们用于 search/replace,但我尽量避免在程序中使用它们。
这里有一个非常简单的纯 python 解决方案:
def camel_case_split(s):
idx = list(map(str.isupper, s))
# mark change of case
l = [0]
for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
if x and not y: # "Ul"
l.append(i)
elif not x and y: # "lU"
l.append(i+1)
l.append(len(s))
# for "lUl", index of "U" will pop twice, have to filter that
return [s[x:y] for x, y in zip(l, l[1:]) if x < y]
还有一些测试
def test():
TESTS = [
("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
("Ta", ['Ta']),
("aT", ['a', 'T']),
("a", ['a']),
("T", ['T']),
("", []),
("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
]
for (q,a) in TESTS:
assert camel_case_split(q) == a
if __name__ == "__main__":
test()
此解决方案还支持数字、空格和自动删除下划线:
def camel_terms(value):
return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
一些测试:
tests = [
"XYZCamelCase",
"CamelCaseXYZ",
"Camel_CaseXYZ",
"3DCamelCase",
"Camel5Case",
"Camel5Case5D",
"Camel Case XYZ"
]
for test in tests:
print(test, "=>", camel_terms(test))
结果:
XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
简单的解决方案:
re.sub(r"([a-z0-9])([A-Z])", r" ", str(text))
import re
re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result
# '(?<=[a-z])' --> means preceding lowercase char (group A)
# '(?=[A-Z])' --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on
我试图实现的目标是这样的:
>>> camel_case_split("CamelCaseXYZ")
['Camel', 'Case', 'XYZ']
>>> camel_case_split("XYZCamelCase")
['XYZ', 'Camel', 'Case']
所以我搜索并找到了这个 perfect regular expression:
(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])
作为我尝试的下一个合乎逻辑的步骤:
>>> re.split("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['CamelCaseXYZ']
为什么这不起作用,如何从 python 中的链接问题中获得结果?
编辑:解决方案摘要
我用几个测试用例测试了所有提供的解决方案:
string: ''
AplusKminus: ['']
casimir_et_hippolyte: []
two_hundred_success: []
kalefranz: string index out of range # with modification: either [] or ['']
string: ' '
AplusKminus: [' ']
casimir_et_hippolyte: []
two_hundred_success: [' ']
kalefranz: [' ']
string: 'lower'
all algorithms: ['lower']
string: 'UPPER'
all algorithms: ['UPPER']
string: 'Initial'
all algorithms: ['Initial']
string: 'dromedaryCase'
AplusKminus: ['dromedary', 'Case']
casimir_et_hippolyte: ['dromedary', 'Case']
two_hundred_success: ['dromedary', 'Case']
kalefranz: ['Dromedary', 'Case'] # with modification: ['dromedary', 'Case']
string: 'CamelCase'
all algorithms: ['Camel', 'Case']
string: 'ABCWordDEF'
AplusKminus: ['ABC', 'Word', 'DEF']
casimir_et_hippolyte: ['ABC', 'Word', 'DEF']
two_hundred_success: ['ABC', 'Word', 'DEF']
kalefranz: ['ABCWord', 'DEF']
总之,您可以说@kalefranz 的解决方案与问题不匹配(参见最后一个案例),而@casimir et hippolyte 的解决方案只吃一个 space,因此违反了一个想法拆分不应更改各个部分。其余两个备选方案之间的唯一区别是,我的解决方案 returns 是一个空字符串输入为空字符串的列表,@200_success returns 的解决方案是一个空列表。 我不知道 python 社区在这个问题上的立场如何,所以我说:我对任何一个都满意。由于 200_success 的解决方案更简单,我接受它作为正确答案。
python 的 re.split
的 documentation 说:
Note that split will never split a string on an empty pattern match.
看到这个时:
>>> re.findall("(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])", "CamelCaseXYZ")
['', '']
很明显,为什么拆分没有按预期进行。 re
模块按照正则表达式的意图查找空匹配项。
由于文档指出这不是错误,而是预期的行为,因此在尝试创建驼峰式拆分时必须解决此问题:
def camel_case_split(identifier):
matches = finditer('(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])', identifier)
split_string = []
# index of beginning of slice
previous = 0
for match in matches:
# get slice
split_string.append(identifier[previous:match.start()])
# advance index
previous = match.start()
# get remaining string
split_string.append(identifier[previous:])
return split_string
这是另一个需要较少代码且不需要复杂正则表达式的解决方案:
def camel_case_split(string):
bldrs = [[string[0].upper()]]
for c in string[1:]:
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
else:
bldrs[-1].append(c)
return [''.join(bldr) for bldr in bldrs]
编辑
以上代码包含一个优化,可避免使用每个附加字符重建整个字符串。撇开那个优化,一个更简单的版本(带注释)可能看起来像
def camel_case_split2(string):
# set the logic for creating a "break"
def is_transition(c1, c2):
return c1.islower() and c2.isupper()
# start the builder list with the first character
# enforce upper case
bldr = [string[0].upper()]
for c in string[1:]:
# get the last character in the last element in the builder
# note that strings can be addressed just like lists
previous_character = bldr[-1][-1]
if is_transition(previous_character, c):
# start a new element in the list
bldr.append(c)
else:
# append the character to the last string
bldr[-1] += c
return bldr
正如@AplusKminus 所解释的,re.split()
永远不会在空模式匹配上分裂。因此,与其拆分,不如尝试找到您感兴趣的组件。
这是一个使用 re.finditer()
模拟拆分的解决方案:
def camel_case_split(identifier):
matches = finditer('.+?(?:(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])|$)', identifier)
return [m.group(0) for m in matches]
大多数时候不需要检查字符串的格式,全局研究比拆分更简单(对于相同的结果):
re.findall(r'[A-Z](?:[a-z]+|[A-Z]*(?=[A-Z]|$))', 'CamelCaseXYZ')
returns
['Camel', 'Case', 'XYZ']
要处理单峰骆驼,您可以使用:
re.findall(r'[A-Z]?[a-z]+|[A-Z]+(?=[A-Z]|$)', 'camelCaseXYZ')
注意:(?=[A-Z]|$)
可以使用双重否定来缩短(带有否定字符 class 的否定前瞻):(?![^A-Z])
使用re.sub()
和split()
import re
name = 'CamelCaseTest123'
splitted = re.sub('([A-Z][a-z]+)', r' ', re.sub('([A-Z]+)', r' ', name)).split()
结果
'CamelCaseTest123' -> ['Camel', 'Case', 'Test123']
'CamelCaseXYZ' -> ['Camel', 'Case', 'XYZ']
'XYZCamelCase' -> ['XYZ', 'Camel', 'Case']
'XYZ' -> ['XYZ']
'IPAddress' -> ['IP', 'Address']
我只是偶然发现了这个案例,并写了一个正则表达式来解决它。实际上,它应该适用于任何一组单词。
RE_WORDS = re.compile(r'''
# Find words in a string. Order matters!
[A-Z]+(?=[A-Z][a-z]) | # All upper case before a capitalized word
[A-Z]?[a-z]+ | # Capitalized words / all lower case
[A-Z]+ | # All upper case
\d+ # Numbers
''', re.VERBOSE)
这里的关键是第一种可能情况的前瞻。它将匹配(并保留)大写单词之前的大写单词:
assert RE_WORDS.findall('FOOBar') == ['FOO', 'Bar']
我认为下面是最优化的
Def count_word(): Return(re.findall(‘[A-Z]?[a-z]+’, input(‘请输入你的字符串’))
打印(count_word())
我知道问题添加了正则表达式的标签。但是,我总是尽量远离正则表达式。所以,这是我没有正则表达式的解决方案:
def split_camel(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
text = "PathURLFinder"
text = reduce(split_camel, a, "")
print text
# prints "Path URL Finder"
print text.split(" ")
# prints "['Path', 'URL', 'Finder']"
编辑: 正如建议的那样,这里是将功能放在单个函数中的代码。
def split_camel(text):
def splitter(text, char):
if len(text) <= 1: # To avoid adding a wrong space in the beginning
return text+char
if char.isupper() and text[-1].islower(): # Regular Camel case
return text + " " + char
elif text[-1].isupper() and char.islower() and text[-2] != " ": # Detect Camel case in case of abbreviations
return text[:-1] + " " + text[-1] + char
else: # Do nothing part
return text + char
converted_text = reduce(splitter, text, "")
return converted_text.split(" ")
split_camel("PathURLFinder")
# prints ['Path', 'URL', 'Finder']
提出更全面的方法。它处理了几个问题,例如数字、以小写字母开头的字符串、单字母单词等。
def camel_case_split(identifier, remove_single_letter_words=False):
"""Parses CamelCase and Snake naming"""
concat_words = re.split('[^a-zA-Z]+', identifier)
def camel_case_split(string):
bldrs = [[string[0].upper()]]
string = string[1:]
for idx, c in enumerate(string):
if bldrs[-1][-1].islower() and c.isupper():
bldrs.append([c])
elif c.isupper() and (idx+1) < len(string) and string[idx+1].islower():
bldrs.append([c])
else:
bldrs[-1].append(c)
words = [''.join(bldr) for bldr in bldrs]
words = [word.lower() for word in words]
return words
words = []
for word in concat_words:
if len(word) > 0:
words.extend(camel_case_split(word))
if remove_single_letter_words:
subset_words = []
for word in words:
if len(word) > 1:
subset_words.append(word)
if len(subset_words) > 0:
words = subset_words
return words
我的要求比 OP 更具体一些。特别是,除了处理所有 OP 案例外,我还需要其他解决方案不提供的以下内容: - 将所有非字母数字输入(例如 !@#$%^&*() 等)视为单词分隔符 - 按如下方式处理数字: - 不能在单词中间 - 不能在单词的开头,除非短语以数字开头
def splitWords(s):
new_s = re.sub(r'[^a-zA-Z0-9]', ' ', # not alphanumeric
re.sub(r'([0-9]+)([^0-9])', '\1 \2', # digit followed by non-digit
re.sub(r'([a-z])([A-Z])','\1 \2', # lower case followed by upper case
re.sub(r'([A-Z])([A-Z][a-z])', '\1 \2', # upper case followed by upper case followed by lower case
s
)
)
)
)
return [x for x in new_s.split(' ') if x]
输出:
for test in ['', ' ', 'lower', 'UPPER', 'Initial', 'dromedaryCase', 'CamelCase', 'ABCWordDEF', 'CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf']:
print test + ':' + str(splitWords(test))
:[]
:[]
lower:['lower']
UPPER:['UPPER']
Initial:['Initial']
dromedaryCase:['dromedary', 'Case']
CamelCase:['Camel', 'Case']
ABCWordDEF:['ABC', 'Word', 'DEF']
CamelCaseXYZand123.how23^ar23e you doing AndABC123XYZdf:['Camel', 'Case', 'XY', 'Zand123', 'how23', 'ar23', 'e', 'you', 'doing', 'And', 'ABC123', 'XY', 'Zdf']
没有正则表达式的工作解决方案
我不太擅长正则表达式。我喜欢在我的 IDE 中将它们用于 search/replace,但我尽量避免在程序中使用它们。
这里有一个非常简单的纯 python 解决方案:
def camel_case_split(s):
idx = list(map(str.isupper, s))
# mark change of case
l = [0]
for (i, (x, y)) in enumerate(zip(idx, idx[1:])):
if x and not y: # "Ul"
l.append(i)
elif not x and y: # "lU"
l.append(i+1)
l.append(len(s))
# for "lUl", index of "U" will pop twice, have to filter that
return [s[x:y] for x, y in zip(l, l[1:]) if x < y]
还有一些测试
def test():
TESTS = [
("aCamelCaseWordT", ['a', 'Camel', 'Case', 'Word', 'T']),
("CamelCaseWordT", ['Camel', 'Case', 'Word', 'T']),
("CamelCaseWordTa", ['Camel', 'Case', 'Word', 'Ta']),
("aCamelCaseWordTa", ['a', 'Camel', 'Case', 'Word', 'Ta']),
("Ta", ['Ta']),
("aT", ['a', 'T']),
("a", ['a']),
("T", ['T']),
("", []),
("XYZCamelCase", ['XYZ', 'Camel', 'Case']),
("CamelCaseXYZ", ['Camel', 'Case', 'XYZ']),
("CamelCaseXYZa", ['Camel', 'Case', 'XY', 'Za']),
]
for (q,a) in TESTS:
assert camel_case_split(q) == a
if __name__ == "__main__":
test()
此解决方案还支持数字、空格和自动删除下划线:
def camel_terms(value):
return re.findall('[A-Z][a-z]+|[0-9A-Z]+(?=[A-Z][a-z])|[0-9A-Z]{2,}|[a-z0-9]{2,}|[a-zA-Z0-9]', value)
一些测试:
tests = [
"XYZCamelCase",
"CamelCaseXYZ",
"Camel_CaseXYZ",
"3DCamelCase",
"Camel5Case",
"Camel5Case5D",
"Camel Case XYZ"
]
for test in tests:
print(test, "=>", camel_terms(test))
结果:
XYZCamelCase => ['XYZ', 'Camel', 'Case']
CamelCaseXYZ => ['Camel', 'Case', 'XYZ']
Camel_CaseXYZ => ['Camel', 'Case', 'XYZ']
3DCamelCase => ['3D', 'Camel', 'Case']
Camel5Case => ['Camel', '5', 'Case']
Camel5Case5D => ['Camel', '5', 'Case', '5D']
Camel Case XYZ => ['Camel', 'Case', 'XYZ']
简单的解决方案:
re.sub(r"([a-z0-9])([A-Z])", r" ", str(text))
import re
re.split('(?<=[a-z])(?=[A-Z])', 'camelCamelCAMEL')
# ['camel', 'Camel', 'CAMEL'] <-- result
# '(?<=[a-z])' --> means preceding lowercase char (group A)
# '(?=[A-Z])' --> means following UPPERCASE char (group B)
# '(group A)(group B)' --> 'aA' or 'aB' or 'bA' and so on