Locality Sensitive Hashing 可以用于动态数据吗?
Can Locality Sensitive Hashing used on dynamic data?
局部敏感散列可以用于动态数据吗?例如,假设我首先在 1,000,000 个文档上使用 LSH,并将结果存储在索引中,然后我想将另一个文档添加到创建的索引中。我可以使用 LSH 来做吗?
是的,你可以做到。您只需计算添加文档与其余文档的 Jaccard 相似度并将其添加到您的索引中。
TABLE Documents (
ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
MinHashes BINARY(512), -- serialized Min Hash results
Name NVARCHAR(255) UNIQUE NOT NULL,
Content VARBINARY(MAX)
)
TABLE SimilarDocumentIndex (
DocumentAID INT REFERENCES Documents(ID),
DocumentBID INT REFERENCES Documents(ID),
Similarity FLOAT, -- Jaccard Similarity 0.0...1.0
PRIMARY KEY CLUSTERED (DocumentAID, DocumentBID)
)
--
-- Find similar documents
--
SELECT TOP 20 DISTINCT DocumentID
FROM (SELECT
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentID
ORDER BY Similarity DESC
--
-- Compare two documents
--
SELECT Similarity
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentAID AND DocumentBID = @DocumentBID
--
-- Adding a new document
--
SET @MinHashes = dbo.CalcMinHashes(@content)
INSERT INTO Document
VALUES(@MinHashes, @name, @content)
SET @DocumentID = SCOPE_IDENTITY()
INSERT INTO SimilarDocumentIndex
SELECT @DocumentID, ID, dbo.JaccardSimilarity(@MinHashes, MinHashes)
FROM Documents
WHERE ID <> @DocumentID
INSERT INTO SimilarDocumentIndex
SELECT DocumentBID, @DocumentID, Similarity
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentID
是的。
由于lsh使用了多个hash来生成多个签名,所以这个签名被带状生成索引。如果您存储随机哈希函数和条带化过程,您可以重新使用它来为新插入生成索引。因此,对于每个新插入,您都会有相应的索引
局部敏感散列可以用于动态数据吗?例如,假设我首先在 1,000,000 个文档上使用 LSH,并将结果存储在索引中,然后我想将另一个文档添加到创建的索引中。我可以使用 LSH 来做吗?
是的,你可以做到。您只需计算添加文档与其余文档的 Jaccard 相似度并将其添加到您的索引中。
TABLE Documents (
ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
MinHashes BINARY(512), -- serialized Min Hash results
Name NVARCHAR(255) UNIQUE NOT NULL,
Content VARBINARY(MAX)
)
TABLE SimilarDocumentIndex (
DocumentAID INT REFERENCES Documents(ID),
DocumentBID INT REFERENCES Documents(ID),
Similarity FLOAT, -- Jaccard Similarity 0.0...1.0
PRIMARY KEY CLUSTERED (DocumentAID, DocumentBID)
)
--
-- Find similar documents
--
SELECT TOP 20 DISTINCT DocumentID
FROM (SELECT
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentID
ORDER BY Similarity DESC
--
-- Compare two documents
--
SELECT Similarity
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentAID AND DocumentBID = @DocumentBID
--
-- Adding a new document
--
SET @MinHashes = dbo.CalcMinHashes(@content)
INSERT INTO Document
VALUES(@MinHashes, @name, @content)
SET @DocumentID = SCOPE_IDENTITY()
INSERT INTO SimilarDocumentIndex
SELECT @DocumentID, ID, dbo.JaccardSimilarity(@MinHashes, MinHashes)
FROM Documents
WHERE ID <> @DocumentID
INSERT INTO SimilarDocumentIndex
SELECT DocumentBID, @DocumentID, Similarity
FROM SimilarDocumentIndex
WHERE DocumentAID = @DocumentID
是的。
由于lsh使用了多个hash来生成多个签名,所以这个签名被带状生成索引。如果您存储随机哈希函数和条带化过程,您可以重新使用它来为新插入生成索引。因此,对于每个新插入,您都会有相应的索引