有没有一种方法可以在不搜索空格或下划线的情况下检测单词
Is there a way to detect words without searching for whitespace or underscores
我正在尝试编写用于生成 python 类 的 CLI。这部分需要验证用户输入中提供的标识符,对于 python 这需要确保标识符符合 pep8 最佳 practices/standards 标识符 - 类 与 CapsCases,字段与 all_lowercase_with_underscores, packages and modules with so on so fourth-
# it is easy to correct when there is a identifier
# with underscores or whitespace and correcting for a class
def package_correct_convention(item):
return item.strip().lower().replace(" ","").replace("_","")
但是当标记之间没有空格或下划线时,我不确定如何正确地将标识符中每个单词的首字母大写。是否可以在不使用 AI 或类似的东西的情况下实现类似的东西:
举个例子:
# providing "ClassA" returns "classa" because there is no delimiter between "class" and "a"
def class_correct_convention(item):
if item.count(" ") or item.count("_"):
# checking whether space or underscore was used as word delimiter.
if item.count(" ") > item.count("_"):
item = item.split(" ")
elif item.count(" ") < item.count("_"):
item = item.split("_")
item = list(map(lambda x: x.title(), item))
return ("".join(item)).replace("_", "").replace(" ","")
# if there is no white space, best we can do it capitalize first letter
return item[0].upper() + item[1:]
好吧,使用基于 AI 的方法会很困难,不完美,需要做很多工作。如果它不值得,也许有更简单且肯定相对有效的方法。
我知道最坏的情况是 "todelineatewordsinastringlikethat"
。
我建议您下载一个英文文本文件,一个单词一行,然后按以下方式继续:
import re
string = "todelineatewordsinastringlikethat"
#with open("mydic.dat", "r") as msg:
# lst = msg.read().splitlines()
lst = ['to','string','in'] #Let's say the dict contains 3 words
lst = sorted(lst, key=len, reverse = True)
replaced = []
for elem in lst:
if elem in string: #Very fast
replaced_str = " ".join(replaced) #Faster to check elem in a string than elem in a list
capitalized = elem[0].upper()+elem[1:] #Prepare your capitalized word
if elem not in replaced_str: #Check if elem could be a substring of something you replaced already
string = re.sub(elem,capitalized,string)
elif elem in replaced_str: #If elem is a sub of something you replaced, you'll protect
protect_replaced = [item for item in replaced if elem in item] #Get the list of replaced items containing the substring elem
for protect in protect_replaced: #Uppercase the whole word to protect, as we do a case sensitive re.sub()
string = re.sub(protect,protect.upper(),string)
string = re.sub(elem,capitalized,string)
for protect in protect_replaced: #Deprotect by doing the reverse, full uppercase to capitalized
string = re.sub(protect.upper(),protect,string)
replaced.append(capitalized) #Append replaced element in the list
print (string)
输出:
TodelIneatewordsInaStringlikethat
#You see that String has been protected but not delIneate, cause it was not in our dict.
这当然不是最优的,但对于一个无论如何都不会像 AI 那样呈现的问题(输入准备在 AI 中非常重要)。
请注意,对单词列表进行反向排序很重要。因为您想首先检测完整的字符串单词,而不是子字符串。就像在 beforehand
中一样,您想要完整的,而不是 before
或 and
。
我正在尝试编写用于生成 python 类 的 CLI。这部分需要验证用户输入中提供的标识符,对于 python 这需要确保标识符符合 pep8 最佳 practices/standards 标识符 - 类 与 CapsCases,字段与 all_lowercase_with_underscores, packages and modules with so on so fourth-
# it is easy to correct when there is a identifier
# with underscores or whitespace and correcting for a class
def package_correct_convention(item):
return item.strip().lower().replace(" ","").replace("_","")
但是当标记之间没有空格或下划线时,我不确定如何正确地将标识符中每个单词的首字母大写。是否可以在不使用 AI 或类似的东西的情况下实现类似的东西:
举个例子:
# providing "ClassA" returns "classa" because there is no delimiter between "class" and "a"
def class_correct_convention(item):
if item.count(" ") or item.count("_"):
# checking whether space or underscore was used as word delimiter.
if item.count(" ") > item.count("_"):
item = item.split(" ")
elif item.count(" ") < item.count("_"):
item = item.split("_")
item = list(map(lambda x: x.title(), item))
return ("".join(item)).replace("_", "").replace(" ","")
# if there is no white space, best we can do it capitalize first letter
return item[0].upper() + item[1:]
好吧,使用基于 AI 的方法会很困难,不完美,需要做很多工作。如果它不值得,也许有更简单且肯定相对有效的方法。
我知道最坏的情况是 "todelineatewordsinastringlikethat"
。
我建议您下载一个英文文本文件,一个单词一行,然后按以下方式继续:
import re
string = "todelineatewordsinastringlikethat"
#with open("mydic.dat", "r") as msg:
# lst = msg.read().splitlines()
lst = ['to','string','in'] #Let's say the dict contains 3 words
lst = sorted(lst, key=len, reverse = True)
replaced = []
for elem in lst:
if elem in string: #Very fast
replaced_str = " ".join(replaced) #Faster to check elem in a string than elem in a list
capitalized = elem[0].upper()+elem[1:] #Prepare your capitalized word
if elem not in replaced_str: #Check if elem could be a substring of something you replaced already
string = re.sub(elem,capitalized,string)
elif elem in replaced_str: #If elem is a sub of something you replaced, you'll protect
protect_replaced = [item for item in replaced if elem in item] #Get the list of replaced items containing the substring elem
for protect in protect_replaced: #Uppercase the whole word to protect, as we do a case sensitive re.sub()
string = re.sub(protect,protect.upper(),string)
string = re.sub(elem,capitalized,string)
for protect in protect_replaced: #Deprotect by doing the reverse, full uppercase to capitalized
string = re.sub(protect.upper(),protect,string)
replaced.append(capitalized) #Append replaced element in the list
print (string)
输出:
TodelIneatewordsInaStringlikethat
#You see that String has been protected but not delIneate, cause it was not in our dict.
这当然不是最优的,但对于一个无论如何都不会像 AI 那样呈现的问题(输入准备在 AI 中非常重要)。
请注意,对单词列表进行反向排序很重要。因为您想首先检测完整的字符串单词,而不是子字符串。就像在 beforehand
中一样,您想要完整的,而不是 before
或 and
。