使用 LPeg 匹配 Unicode 标点符号

Question

我正在尝试创建一个 LPeg 模式来匹配 UTF-8 编码输入中的任何 Unicode 标点符号。我想出了以下 Selene Unicode 和 LPeg 的联姻：

local unicode     = require("unicode")
local lpeg        = require("lpeg")
local punctuation = lpeg.Cmt(lpeg.Cs(any * any^-3), function(s,i,a)
  local match = unicode.utf8.match(a, "^%p")
  if match == nil
    return false
  else
    return i+#match
  end
end)

这似乎可行，但它会丢失由多个 Unicode 代码点组合而成的标点字符（如果存在此类字符），因为我只读取前面 4 个字节，这可能会降低解析器的性能，并且当我向它提供一个包含 runt UTF-8 字符的字符串时，库 match 函数将做什么是未定义的（尽管它似乎可以工作 now）。

我想知道这是否是一种正确的方法，或者是否有更好的方法来实现我正在努力实现的目标。

Answer 1

匹配 UTF-8 字符的正确方法如 the LPeg homepage 中的示例所示。 UTF-8 字符的第一个字节决定了它还有多少字节：

local cont = lpeg.R("81") -- continuation byte

local utf8 = lpeg.R("[=10=]7")
           + lpeg.R("43") * cont
           + lpeg.R("49") * cont * cont
           + lpeg.R("04") * cont * cont * cont

在这个 utf8 模式的基础上，我们可以使用 lpeg.Cmt 和 Selene Unicode match 功能，有点像您建议的：

local punctuation = lpeg.Cmt(lpeg.C(utf8), function (s, i, c)
    if unicode.utf8.match(c, "%p") then
        return i
    end
end)

注意我们returni，这符合Cmt的预期：

The given function gets as arguments the entire subject, the current position (after the match of patt), plus any capture values produced by patt. The first value returned by function defines how the match happens. If the call returns a number, the match succeeds and the returned number becomes the new current position.

这意味着我们应该 return 函数接收的相同数字，即紧跟在 UTF-8 字符之后的位置。

使用 LPeg 匹配 Unicode 标点符号

Matching Unicode punctuation using LPeg

unicode

lua

lpeg