带有真实 "Full Text Search" 和拼写错误的 SQLite（FTS+spellfix 一起）

Question

假设我们有 100 万行这样的行：

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

背景：

我知道如何用 Sqlite 做到这一点：

用单词查询查找一行，最多有几个拼写错误用spellfix module and Levenshtein distance (I have posted a 关于如何编译它，如何使用它，...):

db.enable_load_extension(True)
db.load_extension('./spellfix')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(1, u'Riemann')]

对于 1M 行，这会非常慢！由于 detailed here，postgresql 可能会使用 trigrams 对此进行优化。 Sqlite 提供的快速解决方案是使用 VIRTUAL TABLE USING spellfix:

c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!

查找具有匹配 一个或多个单词 与 FTS ("Full Text Search") 的查询的表达式:

c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()

#Query: 'NUMBERS carmichael'
#Answer: [(2, u'All the Carmichael numbers')]

它不区分大小写，您甚至可以使用顺序错误的两个单词进行查询等：FTS 确实非常强大。但缺点是每个查询关键字必须拼写正确，即FTS本身不允许拼写错误

问题：

如何使用 Sqlite 进行全文搜索 (FTS) 并允许拼写错误? 即 "FTS + spellfix"

示例：

数据库中的行："All the Carmichael numbers"
查询："NUMMBER carmickaeel"应该匹配吧！

如何用 Sqlite 做到这一点？

由于 this page 声明：

，因此 Sqlite 可能是可行的

Or, it [spellfix] could be used with FTS4 to do full-text search using potentially misspelled words.

相关问题：

Answer 1

spellfix1 文档实际上告诉您如何执行此操作。来自 Overview section:

If you intend to use this virtual table in cooperation with an FTS4 table (for spelling correction of search terms) then you might extract the vocabulary using an fts4aux table:
INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

SELECT term from search_aux WHERE col='*'语句。

将此与您的示例联系起来，其中 mytable2 是您的 fts4 虚拟 table，您可以创建一个 fts4aux table 并将这些标记插入您的 mytable3 spellfix1 table 与：

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

您可能希望进一步限定该查询以跳过任何已插入到 spellfix1 中的术语，否则您最终会得到双重条目：

INSERT INTO mytable3(word)
    SELECT term FROM mytable2_terms
    WHERE col='*' AND 
        term not in (SELECT word from mytable3_vocab);

现在您可以使用 mytable3 将拼写错误的单词映射到更正的标记，然后在 MATCH 查询中使用这些更正的标记对 mytable2.

根据您的需要，这可能意味着您需要进行自己的令牌处理和查询构建；没有公开的 fts4 查询语法解析器。因此，您的双标记搜索字符串需要拆分，每个标记运行通过 spellfix1 table 映射到现有标记，然后将这些标记提供给 fts4 查询。

忽略 SQL 语法来处理这个问题，使用 Python 进行拆分很容易：

def spellcheck_terms(conn, terms):
    cursor = conn.cursor()
    base_spellfix = """
        SELECT :term{0} as term, word FROM spellfix1data
        WHERE word MATCH :term{0} and top=1
    """
    terms = terms.split()
    params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
    query = " UNION ".join([
        base_spellfix.format(i + 1) for i in range(len(params))])
    cursor.execute(query, params)
    correction_map = dict(cursor)
    return " ".join([correction_map.get(t, t) for t in terms])

def spellchecked_search(conn, terms):
    corrected_terms = spellcheck_terms(conn, terms)
    cursor = conn.cursor()
    fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'
    cursor.execute(fts_query, (corrected_terms,))
    return cursor.fetchall()

然后 returns [('All the Carmichael numbers',)] for spellchecked_search(db, "NUMMBER carmickaeel").

在 Python 中保留拼写检查处理，这样您就可以根据需要支持更复杂的 FTS 查询；您可能需要 reimplement the expression parser 才能这样做，但至少 Python 为您提供了做到这一点的工具。

一个完整的例子，将上述方法封装在 class 中，它只是将术语提取为字母数字字符序列（根据我对表达式语法规范的阅读，这就足够了）：

import re
import sqlite3
import sys

class FTS4SpellfixSearch(object):
    def __init__(self, conn, spellfix1_path):
        self.conn = conn
        self.conn.enable_load_extension(True)
        self.conn.load_extension(spellfix1_path)

    def create_schema(self):
        self.conn.executescript(
            """
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data
                USING fts4(description text);
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms
                USING fts4aux(fts4data);
            CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data
                USING spellfix1;
            """
        )

    def index_text(self, *text):
        cursor = self.conn.cursor()
        with self.conn:
            params = ((t,) for t in text)
            cursor.executemany("INSERT INTO fts4data VALUES (?)", params)
            cursor.execute(
                """
                INSERT INTO spellfix1data(word)
                SELECT term FROM fts4data_terms
                WHERE col='*' AND
                    term not in (SELECT word from spellfix1data_vocab)
                """
            )

    # fts3 / 4 search expression tokenizer
    # no attempt is made to validate the expression, only
    # to identify valid search terms and extract them.
    # the fts3/4 tokenizer considers any alphanumeric ASCII character
    # and character in the range U+0080 and over to be terms.
    if sys.maxunicode == 0xFFFF:
        # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")
    else:
        # UCS4
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")

    def _terms_from_query(self, search_query):
        """Extract search terms from a fts3/4 query

        Returns a list of terms and a template such that
        template.format(*terms) reconstructs the original query.

        terms using partial* syntax are ignored, as you can't distinguish
        between a misspelled prefix search that happens to match existing
        tokens and a valid spelling that happens to have 'near' tokens in
        the spellfix1 database that would not otherwise be matched by fts4

        """
        template, terms, lastpos = [], [], 0
        for match in self._fts4_expr_terms.finditer(search_query):
            token, (start, end) = match.group(), match.span()
            # skip columnname: and partial* terms by checking next character
            ismeta = search_query[end:end + 1] in {":", "*"}
            # skip digits if preceded by "NEAR/"
            ismeta = ismeta or (
                token.isdigit() and template and template[-1] == "NEAR"
                and "/" in search_query[lastpos:start])
            if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:
                # full search term, not a keyword, column name or partial*
                terms.append(token)
                token = "{}"
            template += search_query[lastpos:start], token
            lastpos = end
        template.append(search_query[lastpos:])
        return terms, "".join(template)

    def spellcheck_terms(self, search_query):
        cursor = self.conn.cursor()
        base_spellfix = """
            SELECT :term{0} as term, word FROM spellfix1data
            WHERE word MATCH :term{0} and top=1
        """
        terms, template = self._terms_from_query(search_query)
        params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
        query = " UNION ".join(
            [base_spellfix.format(i + 1) for i in range(len(params))]
        )
        cursor.execute(query, params)
        correction_map = dict(cursor)
        return template.format(*(correction_map.get(t, t) for t in terms))

    def search(self, search_query):
        corrected_query = self.spellcheck_terms(search_query)
        cursor = self.conn.cursor()
        fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"
        cursor.execute(fts_query, (corrected_query,))
        return {
            "terms": search_query,
            "corrected": corrected_query,
            "results": cursor.fetchall(),
        }

以及使用 class 的交互式演示：

>>> db = sqlite3.connect(":memory:")
>>> fts = FTS4SpellfixSearch(db, './spellfix')
>>> fts.create_schema()
>>> fts.index_text("All the Carmichael numbers")  # your example
>>> from pprint import pprint
>>> pprint(fts.search('NUMMBER carmickaeel'))
{'corrected': 'numbers carmichael',
 'results': [('All the Carmichael numbers',)],
 'terms': 'NUMMBER carmickaeel'}
>>> fts.index_text(
...     "They are great",
...     "Here some other numbers",
... )
>>> pprint(fts.search('here some'))  # edgecase, multiple spellfix matches
{'corrected': 'here some',
 'results': [('Here some other numbers',)],
 'terms': 'here some'}
>>> pprint(fts.search('NUMMBER NOT carmickaeel'))  # using fts4 query syntax 
{'corrected': 'numbers NOT carmichael',
 'results': [('Here some other numbers',)],
 'terms': 'NUMMBER NOT carmickaeel'}

Answer 2

接受的答案很好（完全归功于他），这里有一个细微的变化，虽然对于复杂的情况不如接受的答案那么完整，但有助于理解这个想法：

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')
c = db.cursor()
c.execute("CREATE VIRTUAL TABLE mytable2 USING fts4(description text)")
c.execute("CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2)")
c.execute("CREATE VIRTUAL TABLE mytable3 USING spellfix1")
c.execute("INSERT INTO mytable2 VALUES ('All the Carmichael numbers')")   # populate the table
c.execute("INSERT INTO mytable2 VALUES ('They are great')")
c.execute("INSERT INTO mytable2 VALUES ('Here some other numbers')")
c.execute("INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*'")

def search(query):
    # Correcting each query term with spellfix table
    correctedquery = []
    for t in query.split():
        spellfix_query = "SELECT word FROM mytable3 WHERE word MATCH ? and top=1"
        c.execute(spellfix_query, (t,))
        r = c.fetchone()
        correctedquery.append(r[0] if r is not None else t)  # correct the word if any match in the spellfix table; if no match, keep the word spelled as it is (then the search will give no result!)

    correctedquery = ' '.join(correctedquery)

    # Now do the FTS
    fts_query = 'SELECT * FROM mytable2 WHERE description MATCH ?'
    c.execute(fts_query, (correctedquery,))
    return {'result': c.fetchall(), 'correctedquery': correctedquery, 'query': query}

print(search('NUMBBERS carmickaeel'))
print(search('some HERE'))
print(search('some qsdhiuhsd'))

结果如下：

{'query': 'NUMBBERS carmickaeel', 'correctedquery': u'numbers carmichael', 'result': [(u'All the Carmichael numbers',)]}
{'query': 'some HERE', 'correctedquery': u'some here', 'result': [(u'Here some other numbers',)]}
{'query': 'some qsdhiuhsd', 'correctedquery': u'some qsdhiuhsd', 'result': []}

备注：可以注意到 "Correcting each query term with spellfix table" 部分是用每个术语一个 SQL 查询完成的。研究了此查询与单个 UNION SQL 查询的性能 .

带有真实 "Full Text Search" 和拼写错误的 SQLite（FTS+spellfix 一起）

Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)

python

sqlite

full-text-search

levenshtein-distance

背景：

问题：