awk 与多行正则表达式;基于 awk 匹配的输出文件名

awk with multiline regex; output filename based on awk match

我目前正在尝试从 22kLoC 文件中提取 300 多个函数和子例程,并决定尝试以编程方式进行(我为 'biggest' 块手动完成)。

考虑以下形式的文件

declare sub DoStatsTab12( byval shortlga as string)
declare sub DoStatsTab13( byval shortlga as string)
declare sub ZOMFGAnotherSub

Other lines that start with something other than "/^sub \w+/" or "/^end sub/"

sub main

    This is the first sub: it should be in the output file mainFunc.txt

end sub

sub test

    This is a second sub

    it has more lines than the first.

    It is supposed to go to testFunc.txt

end sub

Function ConvertFileName(ByVal sTheName As String) As String

    This is a function so I should not see it if I am awking subs

    But when I alter the awk to chunk out functions, it will go to ConvertFileNameFunc.txt    

End Function

sub InitialiseVars(a, b, c)

    This sub has some arguments - next step is to parse out its arguments
    Code code code;
    more code;
    ' maybe a comment, even? 


  and some code which is badly indented (original code was written by a guy who didn't believe in structure or documentation)

    and


  with an arbitrary number of newlines between bits of code because why not? 


    So anyhow - the output of awk should be everything from sub InitialiseVars to end sub, and should go into InitialiseVarsFunc.txt

end sub

要点:找到以 ^sub [subName](subArgs) 并以 ^end sub

然后(这是我想不通的地方):将提取的子程序保存到名为[subName]Func.txt

的文件中

awk 将自己推荐为候选人(我过去曾使用 preg_match() 在 PHP 中编写过文本提取正则表达式查询,但我不想指望有WAMP/LAMP 可用性)。

我的出发点是令人愉快的简约(双引号因为 Windows)

awk "/^sub/,/^end sub/" fName

这会找到相关的块(并将它们打印到标准输出)。

将输出放入文件并在 awk 捕获的 之后命名文件的步骤超出了我的范围。

这个过程的早期阶段涉及 awk-ing 子例程名称并存储它们:这很容易,因为每个子程序都由

形式的一行声明
declare sub [subName](subArgs)

所以这个做到了,而且做得很完美 -

awk "match([=13=], /declare sub (\w+)/)
{print substr(, RSTART, index(, \"(\")>0 ? index(, \"(\")-1: RLENGTH)
     > substr(, RSTART, index(, \"(\")>0 ? index(, \"(\")-1: RLENGTH)\".txt\"}"
fName

(我试图展示它,以便很容易看出 awk 的输出文件名和 - 如果有的话,解析到第一个 ')' -是一样的)。

在我看来,如果

的输出
awk '/^sub/,/^end sub/' fName

被连接成一个数组,然后 $2 (在 '(' 处适当截断)将起作用。但它没有。

我查看了处理多行的各种 SO(和其他 SE 系列)线程 awk - 例如,this one and this one,但没有一个给我足够的提醒问题(它们有助于获得匹配本身,但不会将其通过管道传输到以其自身命名的文件)。

我有 awk(和 grep)的 RTFD,也无济于事。

我建议

awk -F '[ (]*' '            # Field separator is space or open paren (for
                            # parameter lists). * because there may be multiple
                            # spaces, and parens only appear after the stuff we
                            # want to extract.
  BEGIN { IGNORECASE = 1 }  # case-insensitive pattern matching is probably
                            # a good idea because Basic is case-insensitive.
  /^sub/ {                  # if the current line begins with "sub"
    outfile =  "Func.bas" # set the output file name
    flag = 1                # and the flag to know that output should happen
  }
  flag == 1 {               # if the flag is set
    print > outfile         # print the line to the outfile
  }
  /^end sub/ {              # when the sub ends, 
    flag = 0                # unset the flag
  }
' foo.bas

请注意,使用简单的模式匹配工具解析源代码很容易出错,因为编程语言通常不是常规语言(除了 Brainfuck 的一些例外)。这种事情总是取决于代码的格式。

例如,如果在代码的某处将子声明分成两行(我相信 _ 可以做到这一点,尽管 Basic 不是我每天都做的事情),尝试从其定义的第一行中提取子名称是徒劳的。格式化也可能对必要的模式进行微调;一行开头的多余空格之类的东西需要处理。严格将这些东西用于一次性代码转换并验证它是否产生了预期的结果,不要试图让它成为常规工作流程的一部分。

另一种 awk 方式

awk -F'[ (]' 'x+=(/^sub/&&file="Func.txt"){print > file}/^end sub/{x=file=""}' file

说明

awk -F'[ (]'                   - Set field separator to space or brackets

x+=(/^sub/&&file="Func.txt") - Sets x to 1 if line begins with sub and sets file 
                                 to the second field + func.txt. As this is a 
                                 condition that is checking if x is true then the 
                                 next block will repeatedly be executed until x 
                                 is unset.

{print > file}                 - Whilst x is true print the line into the set filename


/^end sub/{x=file=""}          - If line begins with end sub then set both x and file 
                                 to nothing.