使用 python 的标记化提取所有 INDENT 标记

Extract all `INDENT` tokens using python's tokenize

我正在尝试使用 python 中的 tokenize 库来标记 python 代码。对于示例输入:-

def cal_cone_curved_surf_area(slant_height,radius):\n\tpi=3.14\n\treturn pi*radius*slant_height\n\n

我正在使用以下代码获取所有标记(此处 p 是示例输入字符串):

text = tokenize.generate_tokens(io.StringIO(p).readline)
[tok for tok in text]

根据 运行 代码片段,我得到以下输出:

[TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
TokenInfo(type=1 (NAME), string='cal_cone_curved_surf_area', start=(1, 4), end=(1, 29), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
TokenInfo(type=53 (OP), string='(', start=(1, 29), end=(1, 30), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=1 (NAME), string='slant_height', start=(1, 30), end=(1, 42), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=53 (OP), string=',', start=(1, 42), end=(1, 43), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=1 (NAME), string='radius', start=(1, 43), end=(1, 49), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=53 (OP), string=')', start=(1, 49), end=(1, 50), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=53 (OP), string=':', start=(1, 50), end=(1, 51), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 51), end=(1, 52), line='def cal_cone_curved_surf_area(slant_height,radius):\n'),
 TokenInfo(type=5 (INDENT), string='\t', start=(2, 0), end=(2, 1), line='\tpi=3.14\n'),
 TokenInfo(type=1 (NAME), string='pi', start=(2, 1), end=(2, 3), line='\tpi=3.14\n'),
 TokenInfo(type=53 (OP), string='=', start=(2, 3), end=(2, 4), line='\tpi=3.14\n'),
 TokenInfo(type=2 (NUMBER), string='3.14', start=(2, 4), end=(2, 8), line='\tpi=3.14\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 8), end=(2, 9), line='\tpi=3.14\n'),
 TokenInfo(type=1 (NAME), string='return', start=(3, 1), end=(3, 7), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=1 (NAME), string='pi', start=(3, 8), end=(3, 10), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=53 (OP), string='*', start=(3, 10), end=(3, 11), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=1 (NAME), string='radius', start=(3, 11), end=(3, 17), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=53 (OP), string='*', start=(3, 17), end=(3, 18), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=1 (NAME), string='slant_height', start=(3, 18), end=(3, 30), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 30), end=(3, 31), line='\treturn pi*radius*slant_height\n'),
 TokenInfo(type=56 (NL), string='\n', start=(4, 0), end=(4, 1), line='\n'),
  TokenInfo(type=6 (DEDENT), string='', start=(5, 0), end=(5, 0), line=''),
  TokenInfo(type=0 (ENDMARKER), string='', start=(5, 0), end=(5, 0), line='')]

可以看出,我只能提取一个 INDENT 标记(第 10 行),但不能提取第二个 NEWLINE 之后的第二个。我如何确保在我的源代码中获得所有正确的 INDENT 标记?

令牌INDENT是在进入一个块时生成的,而不是为每一行生成的。退出块后,generate_tokens() 生成令牌 DEDENT。从 INDENT 到下一个 INDENT 或匹配的 DEDENT 的所有标记都具有相同的缩进级别。