Firefox (Hunspell) 在拼写检查单词之前如何以及如何清理文本?

How and what does Firefox (Hunspell) do to clean text before spellchecking words?

我正在尝试按照 Firefox 在对我正在构建的 Firefox 扩展的单个单词进行拼写检查之前所做的确切方式来清理文本(我的插件使用 nspell,Hunspell 的 JavaScript 实现,因为 Firefox 不'通过扩展 API).

公开它使用的 Hunspell 实例

我通过搜索 "spellcheck" 查看了 Firefox gecko 克隆的代码库,即在 mozSpellChecker.h 文件和其他相关文件中,但我似乎无法找出它们是如何清理文本的。

逆向工程是一个主要的 PITA,到目前为止我有这个:

// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
  if (!content) {
    console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
    return ''
  }

  // ToDo: first split string by spaces in order to properly ignore urls
  const rxUrls = /^(http|https|ftp|www)/
  const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\]/
  const rxSingleQuotes = /^'+|'+$/g

  // split all content by any character that should not form part of a word
  return content.split(rxSeparators)
    .reduce((acc, string) => {
      // remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
      string = string.replace(rxSingleQuotes, '')
      // we never want empty strings, so skip them
      if (string.length < 1) {
        return acc
      }
      // for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
      if (!filter) {
        return acc.concat([string])
      }
      // filter out emails, URLs, numbers, and strings less than 2 characters in length
      if (!string.includes('@') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
        return acc.concat([string])
      }
      return acc
    }, [])
}

但我在测试内容时仍然看到内容之间的巨大差异,例如 - 好吧 - 用于创建此问题的文本区域。

需要说明的是:我正在寻找 Firefox 用于清理文本的确切方法、匹配项和规则,并且由于它是开源的,所以它应该在某个地方,但我似乎无法找到它!

我相信您需要 mozInlineSpellWordUtil.cpp 中的函数。

来自the header

/**
 *    This class extracts text from the DOM and builds it into a single string.
 *    The string includes whitespace breaks whereever non-inline elements begin
 *    and end. This string is broken into "real words", following somewhat
 *    complex rules; for example substrings that look like URLs or
 *    email addresses are treated as single words, but otherwise many kinds of
 *    punctuation are treated as word separators. GetNextWord provides a way
 *    to iterate over these "real words".
 *
 *    The basic operation is:
 *
 *    1. Call Init with the weak pointer to the editor that you're using.
 *    2. Call SetPositionAndEnd to to initialize the current position inside the
 *       previously given range and set where you want to stop spellchecking.
 *       We'll stop at the word boundary after that. If SetEnd is not called,
 *       we'll stop at the end of the root element.
 *    3. Call GetNextWord over and over until it returns false.
 */

您可以找到 the complete source here, but it is fairly complex. For example, here is the method used 将部分文本分类为电子邮件地址或 URL,但仅处理这些就超过 50 行。

编写拼写检查器原则上似乎微不足道,但正如您从源代码中看到的那样,这是一项艰巨的工作。我不是说你不应该尝试,但你可能已经发现,问题在于边缘情况的细节。

举个例子,当您决定什么构成单词边界或不构成单词边界时,您必须决定要忽略哪些字符,包括 ASCII 范围之外的字符。 For example, here 您可以看到 MONGOLIAN TODO SOFT HYPHEN 像 ASCII 连字符一样被处理:

// IsIgnorableCharacter
//
//    These characters are ones that we should ignore in input.

inline bool IsIgnorableCharacter(char ch) {
  return (ch == static_cast<char>(0xAD));  // SOFT HYPHEN
}

inline bool IsIgnorableCharacter(char16_t ch) {
  return (ch == 0xAD ||   // SOFT HYPHEN
          ch == 0x1806);  // MONGOLIAN TODO SOFT HYPHEN
}

再次声明,我并不是要劝阻您从事这个项目,而是以一种可以在 HTML 上下文和多语言环境中工作的方式将文本标记为离散词,这是一个重大努力。