根据 space 和标点符号进行分词,保留标点符号
tokenize according space and punctuation, punctuation kept
我正在寻找一种根据空格或标点符号化或拆分的解决方案。只有标点符号必须保留在结果中。它将用于识别语言 (python, java, html, c...)
输入 string
可以是:
class Foldermanagement():
def __init__(self):
self.today = invoicemng.gettoday()
...
我期望的输出是 list/tokenized,如下所述:
['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_', 'init', ... ,'self', '.', 'today', '=', ...]
欢迎任何解决方案,提前致谢。
我想这就是您要查找的内容:
import string, re, itertools
text = """
class Foldermanagement():
def __init__(self):
self.today = invoicemng.gettoday()
"""
separators = string.punctuation + string.whitespace
separators_re = "|".join(re.escape(x) for x in separators)
tokens = zip(re.split(separators_re, text), re.findall(separators_re, text))
flattened = itertools.chain.from_iterable(tokens)
cleaned = [x for x in flattened if x and not x.isspace()]
# ['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_',
# 'init', '_', '_', '(', 'self', ')', ':', 'self', '.', 'today', '=',
# 'invoicemng', '.', 'gettoday', '(', ')']
我正在寻找一种根据空格或标点符号化或拆分的解决方案。只有标点符号必须保留在结果中。它将用于识别语言 (python, java, html, c...)
输入 string
可以是:
class Foldermanagement():
def __init__(self):
self.today = invoicemng.gettoday()
...
我期望的输出是 list/tokenized,如下所述:
['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_', 'init', ... ,'self', '.', 'today', '=', ...]
欢迎任何解决方案,提前致谢。
我想这就是您要查找的内容:
import string, re, itertools
text = """
class Foldermanagement():
def __init__(self):
self.today = invoicemng.gettoday()
"""
separators = string.punctuation + string.whitespace
separators_re = "|".join(re.escape(x) for x in separators)
tokens = zip(re.split(separators_re, text), re.findall(separators_re, text))
flattened = itertools.chain.from_iterable(tokens)
cleaned = [x for x in flattened if x and not x.isspace()]
# ['class', 'Foldermanagement', '(', ')', ':', 'def', '_', '_',
# 'init', '_', '_', '(', 'self', ')', ':', 'self', '.', 'today', '=',
# 'invoicemng', '.', 'gettoday', '(', ')']