Firefox (Hunspell) 在拼写检查单词之前如何以及如何清理文本?
How and what does Firefox (Hunspell) do to clean text before spellchecking words?
我正在尝试按照 Firefox 在对我正在构建的 Firefox 扩展的单个单词进行拼写检查之前所做的确切方式来清理文本(我的插件使用 nspell,Hunspell 的 JavaScript 实现,因为 Firefox 不'通过扩展 API).
公开它使用的 Hunspell 实例
我通过搜索 "spellcheck" 查看了 Firefox gecko 克隆的代码库,即在 mozSpellChecker.h 文件和其他相关文件中,但我似乎无法找出它们是如何清理文本的。
逆向工程是一个主要的 PITA,到目前为止我有这个:
// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
if (!content) {
console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
return ''
}
// ToDo: first split string by spaces in order to properly ignore urls
const rxUrls = /^(http|https|ftp|www)/
const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\]/
const rxSingleQuotes = /^'+|'+$/g
// split all content by any character that should not form part of a word
return content.split(rxSeparators)
.reduce((acc, string) => {
// remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
string = string.replace(rxSingleQuotes, '')
// we never want empty strings, so skip them
if (string.length < 1) {
return acc
}
// for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
if (!filter) {
return acc.concat([string])
}
// filter out emails, URLs, numbers, and strings less than 2 characters in length
if (!string.includes('@') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
return acc.concat([string])
}
return acc
}, [])
}
但我在测试内容时仍然看到内容之间的巨大差异,例如 - 好吧 - 用于创建此问题的文本区域。
需要说明的是:我正在寻找 Firefox 用于清理文本的确切方法、匹配项和规则,并且由于它是开源的,所以它应该在某个地方,但我似乎无法找到它!
我相信您需要 mozInlineSpellWordUtil.cpp
中的函数。
来自the header:
/**
* This class extracts text from the DOM and builds it into a single string.
* The string includes whitespace breaks whereever non-inline elements begin
* and end. This string is broken into "real words", following somewhat
* complex rules; for example substrings that look like URLs or
* email addresses are treated as single words, but otherwise many kinds of
* punctuation are treated as word separators. GetNextWord provides a way
* to iterate over these "real words".
*
* The basic operation is:
*
* 1. Call Init with the weak pointer to the editor that you're using.
* 2. Call SetPositionAndEnd to to initialize the current position inside the
* previously given range and set where you want to stop spellchecking.
* We'll stop at the word boundary after that. If SetEnd is not called,
* we'll stop at the end of the root element.
* 3. Call GetNextWord over and over until it returns false.
*/
您可以找到 the complete source here, but it is fairly complex. For example, here is the method used 将部分文本分类为电子邮件地址或 URL,但仅处理这些就超过 50 行。
编写拼写检查器原则上似乎微不足道,但正如您从源代码中看到的那样,这是一项艰巨的工作。我不是说你不应该尝试,但你可能已经发现,问题在于边缘情况的细节。
举个例子,当您决定什么构成单词边界或不构成单词边界时,您必须决定要忽略哪些字符,包括 ASCII 范围之外的字符。 For example, here 您可以看到 MONGOLIAN TODO SOFT HYPHEN 像 ASCII 连字符一样被处理:
// IsIgnorableCharacter
//
// These characters are ones that we should ignore in input.
inline bool IsIgnorableCharacter(char ch) {
return (ch == static_cast<char>(0xAD)); // SOFT HYPHEN
}
inline bool IsIgnorableCharacter(char16_t ch) {
return (ch == 0xAD || // SOFT HYPHEN
ch == 0x1806); // MONGOLIAN TODO SOFT HYPHEN
}
再次声明,我并不是要劝阻您从事这个项目,而是以一种可以在 HTML 上下文和多语言环境中工作的方式将文本标记为离散词,这是一个重大努力。
我正在尝试按照 Firefox 在对我正在构建的 Firefox 扩展的单个单词进行拼写检查之前所做的确切方式来清理文本(我的插件使用 nspell,Hunspell 的 JavaScript 实现,因为 Firefox 不'通过扩展 API).
公开它使用的 Hunspell 实例我通过搜索 "spellcheck" 查看了 Firefox gecko 克隆的代码库,即在 mozSpellChecker.h 文件和其他相关文件中,但我似乎无法找出它们是如何清理文本的。
逆向工程是一个主要的 PITA,到目前为止我有这个:
// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
if (!content) {
console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
return ''
}
// ToDo: first split string by spaces in order to properly ignore urls
const rxUrls = /^(http|https|ftp|www)/
const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\]/
const rxSingleQuotes = /^'+|'+$/g
// split all content by any character that should not form part of a word
return content.split(rxSeparators)
.reduce((acc, string) => {
// remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
string = string.replace(rxSingleQuotes, '')
// we never want empty strings, so skip them
if (string.length < 1) {
return acc
}
// for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
if (!filter) {
return acc.concat([string])
}
// filter out emails, URLs, numbers, and strings less than 2 characters in length
if (!string.includes('@') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
return acc.concat([string])
}
return acc
}, [])
}
但我在测试内容时仍然看到内容之间的巨大差异,例如 - 好吧 - 用于创建此问题的文本区域。
需要说明的是:我正在寻找 Firefox 用于清理文本的确切方法、匹配项和规则,并且由于它是开源的,所以它应该在某个地方,但我似乎无法找到它!
我相信您需要 mozInlineSpellWordUtil.cpp
中的函数。
来自the header:
/**
* This class extracts text from the DOM and builds it into a single string.
* The string includes whitespace breaks whereever non-inline elements begin
* and end. This string is broken into "real words", following somewhat
* complex rules; for example substrings that look like URLs or
* email addresses are treated as single words, but otherwise many kinds of
* punctuation are treated as word separators. GetNextWord provides a way
* to iterate over these "real words".
*
* The basic operation is:
*
* 1. Call Init with the weak pointer to the editor that you're using.
* 2. Call SetPositionAndEnd to to initialize the current position inside the
* previously given range and set where you want to stop spellchecking.
* We'll stop at the word boundary after that. If SetEnd is not called,
* we'll stop at the end of the root element.
* 3. Call GetNextWord over and over until it returns false.
*/
您可以找到 the complete source here, but it is fairly complex. For example, here is the method used 将部分文本分类为电子邮件地址或 URL,但仅处理这些就超过 50 行。
编写拼写检查器原则上似乎微不足道,但正如您从源代码中看到的那样,这是一项艰巨的工作。我不是说你不应该尝试,但你可能已经发现,问题在于边缘情况的细节。
举个例子,当您决定什么构成单词边界或不构成单词边界时,您必须决定要忽略哪些字符,包括 ASCII 范围之外的字符。 For example, here 您可以看到 MONGOLIAN TODO SOFT HYPHEN 像 ASCII 连字符一样被处理:
// IsIgnorableCharacter
//
// These characters are ones that we should ignore in input.
inline bool IsIgnorableCharacter(char ch) {
return (ch == static_cast<char>(0xAD)); // SOFT HYPHEN
}
inline bool IsIgnorableCharacter(char16_t ch) {
return (ch == 0xAD || // SOFT HYPHEN
ch == 0x1806); // MONGOLIAN TODO SOFT HYPHEN
}
再次声明,我并不是要劝阻您从事这个项目,而是以一种可以在 HTML 上下文和多语言环境中工作的方式将文本标记为离散词,这是一个重大努力。