有没有一种方法可以在不搜索空格或下划线的情况下检测单词

Question

我正在尝试编写用于生成 python 类的 CLI。这部分需要验证用户输入中提供的标识符，对于 python 这需要确保标识符符合 pep8 最佳 practices/standards 标识符 - 类与 CapsCases，字段与 all_lowercase_with_underscores, packages and modules with so on so fourth-

# it is easy to correct when there is a identifier
# with underscores or whitespace and correcting for a class

def package_correct_convention(item):
    return item.strip().lower().replace(" ","").replace("_","")

但是当标记之间没有空格或下划线时，我不确定如何正确地将标识符中每个单词的首字母大写。是否可以在不使用 AI 或类似的东西的情况下实现类似的东西:

举个例子：

# providing "ClassA" returns "classa" because there is no delimiter between "class" and "a"
def class_correct_convention(item):
    if item.count(" ") or item.count("_"):
        # checking whether space or underscore was used as word delimiter.
        if item.count(" ") > item.count("_"):
            item = item.split(" ")
        elif item.count(" ") < item.count("_"):
            item = item.split("_")
        item = list(map(lambda x: x.title(), item))
        return ("".join(item)).replace("_", "").replace(" ","")
    # if there is no white space, best we can do it capitalize first letter 
    return item[0].upper() + item[1:]

Answer 1

好吧，使用基于 AI 的方法会很困难，不完美，需要做很多工作。如果它不值得，也许有更简单且肯定相对有效的方法。

我知道最坏的情况是 "todelineatewordsinastringlikethat"。

我建议您下载一个英文文本文件，一个单词一行，然后按以下方式继续：

import re

string = "todelineatewordsinastringlikethat" 

#with open("mydic.dat", "r") as msg:
#    lst = msg.read().splitlines()

lst = ['to','string','in'] #Let's say the dict contains 3 words

lst = sorted(lst, key=len, reverse = True)

replaced = []

for elem in lst:

    if elem in string: #Very fast
        replaced_str = " ".join(replaced) #Faster to check elem in a string than elem in a list
        capitalized = elem[0].upper()+elem[1:] #Prepare your capitalized word

        if elem not in replaced_str: #Check if elem could be a substring of something you replaced already
            string = re.sub(elem,capitalized,string) 

        elif elem in replaced_str: #If elem is a sub of something you replaced, you'll protect
            protect_replaced = [item for item in replaced if elem in item] #Get the list of replaced items containing the substring elem

            for protect in protect_replaced: #Uppercase the whole word to protect, as we do a case sensitive re.sub()
                string = re.sub(protect,protect.upper(),string)

            string = re.sub(elem,capitalized,string)

            for protect in protect_replaced: #Deprotect by doing the reverse, full uppercase to capitalized
                string = re.sub(protect.upper(),protect,string)

        replaced.append(capitalized) #Append replaced element in the list
        
print (string)

输出：

TodelIneatewordsInaStringlikethat
#You see that String has been protected but not delIneate, cause it was not in our dict.

这当然不是最优的，但对于一个无论如何都不会像 AI 那样呈现的问题（输入准备在 AI 中非常重要）。

请注意，对单词列表进行反向排序很重要。因为您想首先检测完整的字符串单词，而不是子字符串。就像在 beforehand 中一样，您想要完整的，而不是 before 或 and。

有没有一种方法可以在不搜索空格或下划线的情况下检测单词

Is there a way to detect words without searching for whitespace or underscores

python

conventions

pep8