如何使这个功能更快？

Question

我是 python 的新手，我想让这个功能更快。

此函数获取一个字符串作为参数，并在输出中返回一个 SE（声音元素）列表。

A 'sound element' (SE) 是 1 个或多个辅音后跟 1 个或多个元音的最大序列：

首先是所有辅音
然后所有元音 (aeioujy)
必须忽略所有非字母字符，如空格、数字、冒号、逗号等
必须删除重音字母（例如 è->e）中的所有重音
忽略大小写字母的区别

NOTICE: the only exceptions are the first and the last SE of a verse, that could contain only vowels and only consonants, respectively.

示例：

如果诗句是"Donàld Duck! wènt, to tHe seas'ìde to swim"

SE 是 ['do'、'na'、'lddu'、'ckwe'、'ntto'、'the'、'sea' , 'si', 'de', 'to', 'swi', 'm' ]

def es_converter(string):
    
    
    vowels, li_es, container = ['a', 'e', 'i', 'o', 'u', 'y', 'j'], [] , ''

    #check for every element in the string
    for i in range(len(string)):
        #i is a vowel?
        if not string[i] in vowels:
            # No, then add it in the variable container
            container += string[i]
            # is the last element of the list?
            if i == (len(string) - 1):
                #yes, then insert inside the list li_es, container and then set it back to ''
                li_es.append(container)
                container = ''
            if string[i] == (len(string) - 1):
                li.append(container)
                container = ''
        #if it was the last element, we check if there are other values after i and are vowels
        elif i < (len(string)-1) and string[i+1] in vowels:
            #yes, add in container
            container += string[i]
        else:
            #no, add in container, append container on the list li_es, set container to '' 
            container += string[i]
            li_es.append(container)
            container = ''
    return li_es

谢谢大家的建议！（不幸的是我不能使用任何导入）

Answer 1

当前代码效率低下的一个重要来源是在迭代字符串时一直使用索引。而不是：

for i in range(len(data)):
    x = data[i]
    ...
    if data[i] == ...

你应该总是这样做：

for char in data:
    x = char
    ...
    if char == ...

如果您在某些时候确实需要索引，请使用 enumerate:

for i, char in enumerate(data):
    ...

并且仅在真正需要时才使用索引。

不过，我宁愿在这里使用正则表达式。没有示例数据，我无法计时，但我确信它会比使用 Python 循环快得多。

过程是：

删除所有非字母字符
将字符串设为小写
删除重音符号，您当前的代码没有这样做
使用描述您的条件的正则表达式拆分字符串。

所以，你可以这样做：

import re
import unicodedata

# from 
def strip_accents(text):
    return  unicodedata.normalize('NFD', text)\
           .encode('ascii', 'ignore')\
           .decode("utf-8")

    

def se(data):
    # keep only alphabetical characters
    data = re.sub(r'\W', '', data)
    # make lowercase
    data = data.casefold()
    # strip accents from the remaining data
    data = strip_accents(data)

    # creating the regex: 
    #  - start of the string followed by vowels, or
    #  - consonants followed by vowels, or
    #  - consonants followed by end of string
    vowels = 'aeiouy'
    se_regex = re.compile(rf'^[{vowels}]+|[^{vowels}]+[{vowels}]+|[^{vowels}]+$')
    
    # return the SEs
    return se_regex.findall(data)

示例运行（我在字符串的开头添加了一个元音来测试这种情况）：

data = "A Donàld Duck! wènt, to tHe seas'ìde to swim"
print(se(data))
# ['a', 'do', 'na', 'lddu', 'ckwe', 'ntto', 'the', 'sea', 'si', 'de', 'to', 'swi', 'm']

如何使这个功能更快？

how to make this function faster?

python

performance

list

sublist