从 lark-parser 返回的 AST 中移除正则表达式终端
Removal of regex terminal from AST returned by lark-parser
我对使用 lark
解析 website crawler 的典型输出很感兴趣。这是基于我自己的 github 网站的一些示例输出示例:
--------------------------------------------------------------------
All found URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
--------------------------------------------------------------------
All local URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
--------------------------------------------------------------------
All foreign URLs:
https://github.com/awa5114
https://github.com/jekyll/jekyll
https://github.com/jekyll/minima
--------------------------------------------------------------------
All broken URLs:
我正在使用以下语法:
start: section~4
section: (bar "All " descriptor " URLs:" link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
bar: /-{68}/
%import common.NEWLINE
%ignore NEWLINE
对生成的树调用 pretty
结果如下:
start
section
bar --------------------------------------------------------------------
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
url https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
section
bar --------------------------------------------------------------------
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
section
bar --------------------------------------------------------------------
descriptor
link_list
url https://github.com/awa5114
url https://github.com/jekyll/jekyll
url https://github.com/jekyll/minima
section
bar --------------------------------------------------------------------
descriptor
link_list
这看起来不错,但我想不要在我的树中包含终端bar
。我怎样才能做到这一点?我查看了 docs 并尝试在 bar
前面加上下划线和/或问号,但由于某种原因这无济于事...
其实我刚刚才找到的。这样做的方法不仅是在 bar
前面加上下划线,而且还把它变成大写,如下所示:
start: section~4
section: (_BAR "All " descriptor " URLs:" link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
_BAR: /-{68}/
%import common.NEWLINE
%ignore NEWLINE
这导致以下树:
start
section
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
url https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
section
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
section
descriptor
link_list
url https://github.com/awa5114
url https://github.com/jekyll/jekyll
url https://github.com/jekyll/minima
section
descriptor
link_list
如果在 lark-parser
文档中明确这一点就好了...
我对使用 lark
解析 website crawler 的典型输出很感兴趣。这是基于我自己的 github 网站的一些示例输出示例:
--------------------------------------------------------------------
All found URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
--------------------------------------------------------------------
All local URLs:
https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
https://awa5114.github.io/2021/01/12/mypy-pycharm.html
--------------------------------------------------------------------
All foreign URLs:
https://github.com/awa5114
https://github.com/jekyll/jekyll
https://github.com/jekyll/minima
--------------------------------------------------------------------
All broken URLs:
我正在使用以下语法:
start: section~4
section: (bar "All " descriptor " URLs:" link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
bar: /-{68}/
%import common.NEWLINE
%ignore NEWLINE
对生成的树调用 pretty
结果如下:
start
section
bar --------------------------------------------------------------------
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
url https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
section
bar --------------------------------------------------------------------
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
section
bar --------------------------------------------------------------------
descriptor
link_list
url https://github.com/awa5114
url https://github.com/jekyll/jekyll
url https://github.com/jekyll/minima
section
bar --------------------------------------------------------------------
descriptor
link_list
这看起来不错,但我想不要在我的树中包含终端bar
。我怎样才能做到这一点?我查看了 docs 并尝试在 bar
前面加上下划线和/或问号,但由于某种原因这无济于事...
其实我刚刚才找到的。这样做的方法不仅是在 bar
前面加上下划线,而且还把它变成大写,如下所示:
start: section~4
section: (_BAR "All " descriptor " URLs:" link_list)
link_list: (url)*
descriptor: "found" | "local" | "foreign" | "broken"
url: /.+/
_BAR: /-{68}/
%import common.NEWLINE
%ignore NEWLINE
这导致以下树:
start
section
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
url https://awa5114.github.io/2021/01/12/#step-6-test-the-template-on-a-script
section
descriptor
link_list
url https://awa5114.github.io/2021/01/12/#step-2-select-location-and-environment-for-new-project
url https://awa5114.github.io/2021/01/12/mypy-pycharm.html
section
descriptor
link_list
url https://github.com/awa5114
url https://github.com/jekyll/jekyll
url https://github.com/jekyll/minima
section
descriptor
link_list
如果在 lark-parser
文档中明确这一点就好了...