如何在不影响性能的情况下高效地在Solr中存储重复数据

How to efficiently store repetitive data in Solr without affecting performance

我在 Solr 中存储的数据结构有点像这样。

[{
    "Product": "Boomerang"
    "Price": 42,
    "Stores": ["Sport Shack", "Joe's Sport Supplies", "Sports and More", "The Outdoor Shop"]
},
{
    "Product": "Juggling Chainsaws"
    "Price": 94,
    "Stores": ["Sport Shack", "Joe's Sport Supplies", "Sports and More","The Outdoor Shop"]
},
{
    "Product": "Chainsaw"
    "Price": 5,
    "Stores": ["Labor Store", "The Outdoor Shop", "Fish n Woodchips"],
}]

"Stores" 字段中有数千种具有相同值的不同产品。

有没有一种方法可以消除重复存储这些相同值的需要,而不影响查询的搜索性能,例如:'Find a chainsaw from Labor Store'

这就是我的想法:

[{
    "Product": "Boomerang"
    "Price": 42,
    "StoreGroup": "NoveltySportsStores",
},
{
    "Product": "Juggling Chainsaws"
    "Price": 94,
    "StoreGroup": "NoveltySportsStores",
},
{
    "Product": "Chainsaw"
    "Price": 5,
    "StoreGroup": "OutdoorsStores"
},
{
    "NoveltySportsStores": ["Sport Shack", "Joe's Sport Supplies", "Sports and More", "The Outdoor Shop"]
},
{
    "OutdoorsStores": ["Labor Store", "The Outdoor Shop", "Fish n Woodchips"]
}]

编辑: 这个例子是完全编造的。对于我的真实用例,组将保持不变,每个组重复约 5000 次,总共约 50000 个组。

您将 Solr/Lucene 视为 RDBMS,但事实并非如此。即使它在您看来重复太多和资源损失,但事实并非如此。第一种方法是索引数据的自然且最好的方法。

您也可以将其用作第二种方式,但第一种方式更好,也更简单。