使用纯 html 内容索引 PostgreSQL 数据库的最佳方法？

Question

我有一个包含数百万 post 的数据库，每个数据库都有一个“内容”列，其中包含纯 HTML 中的 post 内容。

<div class="quoteheader"><a href="http://website.com/message?id=52501">
Quote from: X on October 22, 2013, 02:07:08 PM</a>
</div>
<div class="quote">
Hi, how are you all?
<br></div>
<br>I'm good, how about you?

我想制作一个完整的搜索工具，让人们可以搜索 post。在这种情况下，有人可以搜索“你好吗”，结果会是 post.

我考虑过使用 gin 创建一个 ts_vector 索引：

CREATE INDEX posts_content_search ON posts using gin(to_tsvector('simple', content));

允许此类搜索。

SELECT * FROM posts WHERE to_tsvector('simple', content) @@ phraseto_tsquery('simple', 'how are you');

然而，在创建它时，它不仅不断显示很多这样的消息：

DETAIL:  Words longer than 2047 characters are ignored.
NOTICE:  word is too long to be indexed

但它也会在索引中保存 html 标签（例如：div、b、a、br...），而最好的办法是只删除标签和索引post 真实内容（“嗨，你们好吗”和“我很好，你呢”）

创建索引以允许此类搜索的最佳方法是什么？

Answer 1

'simple'已经排除了html标签的内容，从strip(to_tsvector('simple',content)):

的输出可以看出

                                                         strip                                                         
-----------------------------------------------------------------------------------------------------------------------
 '02' '07' '08' '2013' '22' 'about' 'all' 'are' 'from' 'good' 'hi' 'how' 'i' 'm' 'october' 'on' 'pm' 'quote' 'x' 'you'

请注意缺少 'br'、'div' 等

包含“引用自：X”部分，因为它不在标签中。如果你想排除它，你想使用什么逻辑来做到这一点？

关于长词的警告可以忽略。如果您需要有关修复它们的建议，您应该向我们展示一个产生它们的示例。

使用纯 html 内容索引 PostgreSQL 数据库的最佳方法？

Best approach to index a PostgreSQL database with plain html content?

sql

postgresql

indexing

full-text-search