有没有一种方法可以在不搜索空格或下划线的情况下检测单词

Is there a way to detect words without searching for whitespace or underscores

我正在尝试编写用于生成 python 类 的 CLI。这部分需要验证用户输入中提供的标识符,对于 python 这需要确保标识符符合 pep8 最佳 practices/standards 标识符 - 类 与 CapsCases,字段与 all_lowercase_with_underscores, packages and modules with so on so fourth-

# it is easy to correct when there is a identifier
# with underscores or whitespace and correcting for a class

def package_correct_convention(item):
    return item.strip().lower().replace(" ","").replace("_","")

但是当标记之间没有空格或下划线时,我不确定如何正确地将标识符中每个单词的首字母大写。是否可以在不使用 AI 或类似的东西的情况下实现类似的东西:

举个例子:

# providing "ClassA" returns "classa" because there is no delimiter between "class" and "a"
def class_correct_convention(item):
    if item.count(" ") or item.count("_"):
        # checking whether space or underscore was used as word delimiter.
        if item.count(" ") > item.count("_"):
            item = item.split(" ")
        elif item.count(" ") < item.count("_"):
            item = item.split("_")
        item = list(map(lambda x: x.title(), item))
        return ("".join(item)).replace("_", "").replace(" ","")
    # if there is no white space, best we can do it capitalize first letter 
    return item[0].upper() + item[1:]

好吧,使用基于 AI 的方法会很困难,不完美,需要做很多工作。如果它不值得,也许有更简单且肯定相对有效的方法。

我知道最坏的情况是 "todelineatewordsinastringlikethat"

我建议您下载一个英文文本文件,一个单词一行,然后按以下方式继续:

import re

string = "todelineatewordsinastringlikethat" 

#with open("mydic.dat", "r") as msg:
#    lst = msg.read().splitlines()

lst = ['to','string','in'] #Let's say the dict contains 3 words

lst = sorted(lst, key=len, reverse = True)

replaced = []

for elem in lst:

    if elem in string: #Very fast
        replaced_str = " ".join(replaced) #Faster to check elem in a string than elem in a list
        capitalized = elem[0].upper()+elem[1:] #Prepare your capitalized word

        if elem not in replaced_str: #Check if elem could be a substring of something you replaced already
            string = re.sub(elem,capitalized,string) 

        elif elem in replaced_str: #If elem is a sub of something you replaced, you'll protect
            protect_replaced = [item for item in replaced if elem in item] #Get the list of replaced items containing the substring elem

            for protect in protect_replaced: #Uppercase the whole word to protect, as we do a case sensitive re.sub()
                string = re.sub(protect,protect.upper(),string)

            string = re.sub(elem,capitalized,string)

            for protect in protect_replaced: #Deprotect by doing the reverse, full uppercase to capitalized
                string = re.sub(protect.upper(),protect,string)

        replaced.append(capitalized) #Append replaced element in the list
        
print (string)

输出:

TodelIneatewordsInaStringlikethat
#You see that String has been protected but not delIneate, cause it was not in our dict.

这当然不是最优的,但对于一个无论如何都不会像 AI 那样呈现的问题(输入准备在 AI 中非常重要)。

请注意,对单词列表进行反向排序很重要。因为您想首先检测完整的字符串单词,而不是子字符串。就像在 beforehand 中一样,您想要完整的,而不是 beforeand