如何用javascript开发一个词法分析器？

Question

我开发了一个词法分析器函数，它获取一个字符串并将字符串中的项目分隔成一个数组，如下所示：

const lexer = (str) =>
  str
    .split(" ")
    .map((s) => s.trim())
    .filter((s) => s.length);

console.log(lexer("John Doe")) // outputs ["John" , "Doe"]

现在我想用javascript开发一个词法分析器来分析类型，像这样:

if (foo) {
  bar();
}

和 return 输出如下：

[
  {
    lexeme: 'if',
    type: 'keyword',
    position: {
      row: 0,
      col: 0
    }
  },
  {
    lexeme: '(',
    type: 'open_paran',
    position: {
      row: 0,
      col: 3
    }
  },
  {
    lexeme: 'foo',
    type: 'identifier',
    position: {
      row: 0,
      col: 4
    }
  },
  ...
]

如何开发一个词法分析器 javascript 来识别类型？

提前致谢。

Answer 1

我在 JavaScript（例如 KaTeX and CoffeeScript) is to define a regular expression 中看到的最常见的词法分析模式包含您可能看到的所有标记，并以某种方式遍历该正则表达式的匹配项。

这是一个涵盖您的 JavaScript 示例的简单词法分析器（但也会跳过无效内容）：

const tokenRegExp = /[(){}\n]|(\w+)/g;
const tokenMap = {
  '(': 'open_paren',
  ')': 'close_paren',
  '{': 'open_brace',
  '}': 'close_brace',
}
let row = 0, col = 0;
const tokens = [];
while (let match = tokenRegExp.exec(input)) {
  let type;
  if (match[1]) { // use groups to identify which part of the RegExp is matching
    type = 'identifier';
  } else if (tokenMap[match[0]]) { // use lookup table for simple tokens
    type = tokenMap[match[0]];
  }
  if (type) {
    tokens.push({
      lexeme: match[0],
      type,
      position: {row, col},
    });
  }
  // Update row and column number
  if (match[0] === '\n') {
    row++;
    col = 0;
  } else {
    col += match[0].length;
  }
}

其他解析器会使用正则表达式来匹配字符串的前缀，然后丢弃该部分字符串，并从它停止的地方继续匹配。（这样可以避免跳过无效内容。）

不过，我不建议您编写自己的 JavaScript 词法分析器，除非出于教育目的；有很多可能会比你不费吹灰之力就能捕捉到更多的边缘情况。

如何用javascript开发一个词法分析器？

How to develop a lexical analyzer with javascript?

javascript

lexical-analysis