如何在 Node 中使用 Textmate 语法标记代码片段

Question

我正在尝试在我的图书馆网站上对代码片段进行语法高亮显示。我试过 Highlight.js 和 Prism，但它们都没有正确标记代码（它是 Ruby），所以最后代码没有正确地语法高亮显示。这是因为他们都实现了自己的标记化正则表达式，这是一种必然存在缺陷的方法。

然后我发现 GitHub, Atom and VSCode all use TextMate grammars 用于标记化。这对我来说听起来是正确的方法，将语言语法维护在一个地方，这样其他工具就可以重用它们，而不是每个工具都定义自己的语法。

我的问题是：如何在 Node 中使用 TextMate 语法标记代码字符串？我的目标是：

const codeSnippet = `
class Foo
  def bar
    puts "baz"
  end
end
`

const tokenized = tokenizeCode(codeSnippet, 'ruby')

tokenized // some kind of array of tokens, e.g:
// [
//   ['keyword', 'class'],
//   ['whitespace', ' '],
//   ['class', 'Foo'],
//   ...
// ]

我试过 vscode-textmate，这似乎是 VSCode 用于其自己的语法高亮显示的方法。但是，我不知道如何使用它来实现上述功能。

最终我想得到 HTML 我可以语法高亮显示：

<pre>
  <code>
    <span class="token kewyord">class</span> <span class="token class">Foo</span>
    <!-- ... -->
  </code>
</pre>

同样，我试过 highlight.js 和 Prism，但它们都错误地标记化了即使是最简单的 Ruby 代码。

编辑

以下是 Prism 和 Highlight.js 错误标记 Ruby 代码的一些示例：

Highlight.js – 不将 Post 标记为 "constant"

const hljs = require("highlight.js/lib/highlight.js");
hljs.registerLanguage('ruby', require('highlight.js/lib/languages/ruby'));

const rubyCode = `Post.create(params[:post])`
const html = hljs.highlight('ruby', rubyCode).value

console.log(html)
// Post.create(params[<span class="hljs-symbol">:post</span>])

Prism – 不会将 foo: 标记为 "symbol"

const Prism = require('prismjs');
const loadLanguages = require('prismjs/components/');
loadLanguages(['ruby']);

const rubyCode = `{ foo: "bar" }`
const html = Prism.highlight(rubyCode, Prism.languages.ruby, 'ruby')

console.log(html)
// <span class="token punctuation">{</span> foo<span class="token punctuation">:</span> <span class="token string">"bar"</span> <span class="token punctuation">}</span>

Answer 1

发表评论后，我又试了一次，这次成功了。以下示例显示如何使用官方 TypeScript.tmLanguage vscode-textmate 但基础知识应该适用于其他语言。

首先确保您的机器上和 PATH 变量中的 Windows 上安装了 Python 2.7（不是 3.X）。
使用 npm 或 yarn 安装 vscode-textmate，这将在安装期间调用所需的 Python 解释器。
获取您的 XML 语法（通常以 .tmLanguage 结尾）并将其放置在项目根目录中。
使用vscode-textmate插件如下：

import * as fs from "fs";
import { INITIAL, parseRawGrammar, Registry } from "vscode-textmate";

const registry = new Registry({
    // eslint-disable-next-line @typescript-eslint/require-await
    loadGrammar: async (scopeName) => {
        if (scopeName === "source.ts") {
            return new Promise<string>((resolve, reject) =>
                fs.readFile("./grammars/TypeScript.tmLanguage", (error, data) =>
                    error !== null ? reject(error) : resolve(data.toString())
                )
            ).then((data) => parseRawGrammar(data));
        }
        console.info(`Unknown scope: ${scopeName}`);
        return null;
    },
});

registry.loadGrammar("source.ts").then(
    (grammar) => {
        fs.readFileSync("./samples/test.ts")
            .toString()
            .split("\n")
            .reduce((previousRuleStack, line) => {
                console.info(`Tokenizing line: ${line}`);
                const { ruleStack, tokens } = grammar.tokenizeLine(line, previousRuleStack);
                tokens.forEach((token) => {
                    console.info(
                        ` - ${token.startIndex}-${token.endIndex} (${line.substring(
                            token.startIndex,
                            token.endIndex
                        )}) with scopes ${token.scopes.join(", ")}`
                    );
                });
                return ruleStack;
            }, INITIAL);
    },
    (error) => {
        console.error(error);
    }
);

请记住，source.ts 字符串不是指文件，它是语法文件中的作用域名称。在你的情况下很可能是 source.ruby 。此外，该代码段未优化且几乎不可读，但您应该首先了解如何使用该插件。

提取令牌后，您可以根据需要相应地映射它们。

我的代码片段中的输出如下所示：

Answer 2

我找到了 Highlights package under the Atom organization, which uses TextMate grammars and produces tokenized markup. It also has a synchronous API, which I need for integrating with Remarkable。

const Highlights = require("highlights")

const highlighter = new Highlights()

const html = highlighter.highlightSync({
  fileContents: 'answer = 42',
  scopeName: 'source.ruby',
})

html //=>
// <pre class="editor editor-colors">
//   <div class="line">
//     <span class="source ruby">
//       <span>answer&nbsp;</span>
//       <span class="keyword operator assignment ruby">
//         <span>=</span>
//       </span>
//       <span>&nbsp;</span>
//       <span class="constant numeric ruby">
//         <span>42</span>
//       </span>
//     </span> 
//   </div>
// </pre>

在后台它使用 First Mate 进行标记化，这是 vscode-texmate 的替代方法，但使用起来更容易：

const { GrammarRegistry } = require('first-mate')

const registry = new GrammarRegistry()
const grammar = registry.loadGrammarSync('./ruby.cson')

const tokens = grammar.tokenizeLines('answer = 42') // does all the work

tokens[0] //=>
// [ { value: 'answer ', scopes: [ 'source.ruby' ] },
//   { value: '=',
//     scopes: [ 'source.ruby', 'keyword.operator.assignment.ruby' ] },
//   { value: ' ', scopes: [ 'source.ruby' ] },
//   { value: '42',
//     scopes: [ 'source.ruby', 'constant.numeric.ruby' ] } ]

如何在 Node 中使用 Textmate 语法标记代码片段

How to tokenize a code snippet using a Textmate grammar in Node

syntax-highlighting

token

node.js

编辑