Python + Sqlite 的字符串相似度(编辑距离/编辑距离)

String similarity with Python + Sqlite (Levenshtein distance / edit distance)

Python+Sqlite 中是否有可用的字符串相似性度量,例如 sqlite3 模块?

用例示例:

import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')

此查询应匹配 ID 为 1 的行,但不匹配 ID 为 2 的行:

c.execute('SELECT * FROM mytable WHERE dist(description, "He lo wrold gyus") < 6')

如何在 Sqlite+Python 中做到这一点?

关于我目前发现的内容的注释:

这是一个现成的例子test.py:

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')                 # for Linux
#db.load_extension('./spellfix.dll')            # <-- UNCOMMENT HERE FOR WINDOWS
db.enable_load_extension(False)
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "hello world, guys")')
c.execute('INSERT INTO mytable VALUES (2, "hello there everybody")')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "hel o wrold guy") < 600')
print c.fetchall()
# Output: [(1, u'hello world, guys')]

重要说明:距离 editdist3 已标准化,因此

the value of 100 is used for insertion and deletion and 150 is used for substitution


这是在 Windows 上首先要做的事情:

  1. 下载https://sqlite.org/2016/sqlite-src-3110100.zip, https://sqlite.org/2016/sqlite-amalgamation-3110100.zip并解压

  2. 用新的 sqlite3.dll from here 替换 C:\Python27\DLLs\sqlite3.dll。如果跳过这个,你会得到 sqlite3.OperationalError: The specified procedure could not be found 稍后

  3. 运行:

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat"  
    

    call "C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\vcvarsall.bat" x64
    cl /I sqlite-amalgamation-3110100/ sqlite-src-3110100/ext/misc/spellfix.c /link /DLL /OUT:spellfix.dll
    python test.py
    

    (对于 MinGW,它将是:gcc -g -shared spellfix.c -I ~/sqlite-amalgation-3230100/ -o spellfix.dll

以下是在 Linux Debian 上的操作方法:

(基于

apt-get -y install unzip build-essential libsqlite3-dev
wget https://sqlite.org/2016/sqlite-src-3110100.zip
unzip sqlite-src-3110100.zip
gcc -shared -fPIC -Wall -Isqlite-src-3110100 sqlite-src-3110100/ext/misc/spellfix.c -o spellfix.so
python test.py

以下是在 Linux Debian 上使用旧 Python 版本的方法:

如果您的发行版 Python 有点旧,则需要另一种方法。由于 sqlite3 模块内置于 Python,因此 not straightforward to upgrade it (pip install --upgrade pysqlite would only upgrade the pysqlite module, not the underlying SQLite library). Thus 似乎可以工作,例如如果 import sqlite3; print sqlite3.sqlite_version 是 3.8.2:

wget https://www.sqlite.org/src/tarball/27392118/SQLite-27392118.tar.gz
tar xvfz SQLite-27392118.tar.gz
cd SQLite-27392118 ; sh configure ; make sqlite3.c ; cd ..
gcc -g -fPIC -shared SQLite-27392118/ext/misc/spellfix.c -I SQLite-27392118/src/ -o spellfix.so
python test.py   # [(1, u'hello world, guys')]

我将距离相关函数(Damerau-Levenshtein、Jaro-Winkler、最长公共子串和子序列)实现为 SQLite 运行-time 可加载扩展。支持任何 UTF-8 字符串。

https://github.com/schiffma/distlib