使用全文索引来抓取二进制 blob

Using Full-Text indexing to crawl binary blobs

如果我将二进制文件(例如 doc、html、xml、xps、docx、pdf)存储在 SQL 服务器的 varbinary(max) 列中,如何才能我使用全文索引来抓取二进制文件?

假设我创建了一个 table 来存储二进制文件:

CREATE TABLE Documents (
    DocumentID int IDENTITY,
    Filename nvarchar(32000),
    Data varbinary(max),
)

我如何利用 Windows 提供的 IFilter 系统来抓取这些二进制文件并提取有用的、可搜索的信息?

这样做的动机当然是微软的索引服务已被弃用,取而代之的是 Windows 搜索。 Indexing Service provided an OLEDB provider (MSIDX SQL 服务器可以用来查询索引服务目录。索引服务 OLE DB 提供程序

Windows 搜索,另一方面没办法查询目录。 SQL 服务器无法访问 Windows 搜索。

幸运的是,Windows 搜索(以及之前的索引服务)的功能被引入 SQL 服务器本身。 SQL 服务器全文索引服务使用已存在 19 年的相同 IFilter 机制。

问题是:如何使用它来爬取存储在数据库中的 blob。

SQL 服务器全文可以索引 varbinaryimage 列。

您可以看到SQL服务器目前支持的所有文件类型列表:

SELECT * FROM sys.fulltext_document_types

例如:

| document_type | class_id                             | path                                                                             | version           | manufacturer          |
|---------------|--------------------------------------|----------------------------------------------------------------------------------|-------------------|-----------------------|
| .doc          | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll                                                  | 2008.0.9200.16384 | Microsoft Corporation |
| .txt          | C7310720-AC80-11D1-8DF3-00C04FB6EF4F | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll   | 12.0.6828.0       | Microsoft Corporation |
| .xls          | F07F3920-7B8C-11CF-9BE8-00AA004B9986 | C:\Windows\system32\offfilt.dll                                                  | 2008.0.9200.16384 | Microsoft Corporation |
| .xml          | 41B9BE05-B3AF-460C-BF0B-2CDD44A093B1 | c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\xmlfilt.dll | 12.0.9735.0       | Microsoft Corporation |

在创建 varbinary(或 image)列以包含二进制文件时,您必须有另一个字符串列来提供文件通过其扩展名键入(例如“.doc”)

CREATE TABLE Documents (
    DocumentID int IDENTITY,
    Filename nvarchar(32000),
    Data varbinary(max),
    DataType varchar(50) --contains the file extension (e.g. ".docx", ".pdf")
)

将二进制列添加到全文索引时SQL服务器需要您告诉它哪个列包含数据类型字符串:

ALTER FULLTEXT INDEX ON [dbo].[Documents] 
ADD ([Data] TYPE COLUMN [DataType])

您可以通过从服务器上的文件系统导入二进制文件来进行测试:

INSERT INTO Documents(filename, DataType, data) 
SELECT 
   'Managing Storage Spaces with PowerShell.doc' AS Filename, 
   '.doc', * 
FROM OPENROWSET(BULK N'C:\Managing Storage Spaces with PowerShell.doc', SINGLE_BLOB) AS Data

您可以使用以下方式查看目录状态:

DECLARE @CatalogName varchar(50);
SET @CatalogName = 'Scratch';

SELECT
    CASE FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus')
    WHEN 0 THEN 'Idle'
    WHEN 1 THEN 'Full population in progress'
    WHEN 2 THEN 'Paused'
    WHEN 3 THEN 'Throttled'
    WHEN 4 THEN 'Recovering'
    WHEN 5 THEN 'Shutdown'
    WHEN 6 THEN 'Incremental population in progress'
    WHEN 7 THEN 'Building index'
    WHEN 8 THEN 'Disk is full. Paused.'
    WHEN 9 THEN 'Change tracking'
    ELSE 'Unknown'
    END+' ('+CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateStatus') AS varchar(50))+')' AS PopulateStatus,
    FULLTEXTCATALOGPROPERTY(@CatalogName, 'ItemCount') AS ItemCount,
    CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'IndexSize') AS varchar(50))+ ' MiB' AS IndexSize,
    CAST(FULLTEXTCATALOGPROPERTY(@CatalogName, 'UniqueKeyCount') AS varchar(50))+' words' AS UniqueKeyCount,
    FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAge') AS PopulateCompletionAge,
    DATEADD(second, FULLTEXTCATALOGPROPERTY(@CatalogName, 'PopulateCompletionAGe'), 0) AS PopulateCompletionDate

并且可以查询目录:

SELECT * FROM Documents
WHERE FREETEXT(Data, 'Bruce')

额外的 IFilter

SQL 服务器有一组有限的内置过滤器。它还可以使用在系统上注册的 IFilter 实现(例如 Microsoft Office 2010 Filter Pack 提供 docxmsgonepubvsxxlsx zip 支持).

您必须通过 enabling the option:

启用 OS 级别的过滤器
sp_fulltext_service 'load_os_resources', 1

并重新启动 SQL 服务器服务。

load_os_resources int

Indicates whether operating system word breakers, stemmers, and filters are registered and used with this instance of SQL Server. One of:

0: Use only filters and word breakers specific to this instance of SQL Server.
1: Load operating system filters and word breakers.

By default, this property is disabled to prevent inadvertent behavior changes by updates made to the operating system. Enabling use of operating system resources provides access to resources for languages and document types registered with Microsoft Indexing Service that do not have an instance-specific resource installed. If you enable the loading of operating system resources, ensure that the operating system resources are trusted signed binaries; otherwise, they cannot be loaded when verify_signature is set to 1.

如果在 SQL Server 2008 之前使用 SQL Server,启用此选项后还必须重新启动全文索引服务:

net stop msftesql
net start msftesql

Microsoft 提供的过滤器包包含 IFilter Office 2007 文件类型:

并且 Adob​​e 提供了一个 IFilter 用于索引 PDF(Foxit provides one,但他们的不是免费的):

红利阅读