最适合前缀匹配搜索的数据结构

Question

我必须创建一个客户列表系统（可以有 1000 万客户），每个客户都有一个唯一的 ID，一个唯一的 ID 由 10 个字母组成，前 3 个是大写字母，最后一个7 位是数字（例如：LQK0333208、HCK1646129、...）。系统必须以最快的方式执行两种搜索操作（完全匹配搜索和部分匹配搜索）：

对于完全匹配搜索，用户输入完整的客户ID，系统会显示匹配客户的详细信息，如果没有匹配客户，则会显示错误消息。
对于部分匹配搜索，用户输入多个（最少5个，最多8个）Customer ID的首字母，系统会显示匹配客户的详细信息，如果没有匹配客户则提示错误信息。如果匹配的客户数大于10，则只显示其中的10个。

那么适合这个系统的数据结构是什么？目前，我正在使用AVL树来处理问题：

对于精确匹配搜索，我将进行对数搜索（左右子树）：O(log(n))。
对于部分匹配搜索，我将对 AVL 树执行顺序搜索并检查每个客户是否具有所需的前缀。这是一个线性搜索：O(n).

但我想进行部分匹配搜索，系统会在时间复杂度方面进行更好的搜索。那么关于数据结构的任何建议是否适合系统的要求？

编辑 1： 我已经使用 Trie 树和三元搜索树测试了该程序，但针对的是更大的数据集，例如（1000 万客户）。我无法将内存中的数据结构存储在具有更大数据集的内存中。那么有什么建议吗？
编辑 2: 我已经测试了排序数组数据结构，它在 1000 万用户的数据集上运行良好。实际上，当我对 Trie 树或三元树一无所知时，这是我的第一个方法。据我了解，首先我们将所有客户存储在一个数组中，然后使用一些排序算法（如快速排序）对数组进行排序。然后进行二分查找查找key，也就是O(log(n))进行查找操作，相当不错！但是从长远来看，当我们需要向数组添加额外的数据（不是创建新数据，而是添加到数组）时，例如只需要多一个客户，那么添加新元素最坏情况下会花费 O(n)案例，因为我们需要找到添加和移动元素的位置。但是对于像 Trie 或 Ternary 树这样的数据结构，在添加新元素时，它可能只需要 O(1)，因为我们只需要遍历树来找到字符串。如果我们不介意 space 的复杂性，我认为 trie 或三叉树最适合这个项目。

Answer 1

一个合适的数据结构是trie。这是一棵所有前缀的树，其中每个节点（根节点除外）代表一个字符，从根节点到叶子节点的每条可能路径都是一个字符序列，对应一个有效的ID。

部分匹配意味着存在从根到内部节点的路径。

如果使用高效的 child 查找实现，则可以在 10 个步骤中找到此特定用例中的匹配项。因此，如果我们将 10 视为常数，则可以在常数时间内完成匹配，而不管树有多大（即宽）。这假设可以在恒定时间内（平均）通过字符查找 child。

因为在这个特定的用例中，字母表是有限的（仅限大写或仅限数字），一个节点最多可以有 26 个 child 条目，这些条目可以存储在该大小的数组中，其中索引映射到相应的字符。这将确保从 parent 节点步进到相关 child 节点的时间恒定。或者，也可以使用哈希系统（而不是具有 26 个槽的数组）。

这是 JavaScript 中的演示实现（使用 object 表示 children，即“字典”）：

class TrieNode {
    constructor(data=null) {
        this.children = {}; // Dictionary, <character, TrieNode>
        this.data = data; // Non-null when this node represents the end of a valid word
    }
    addWord(word, data) {
        let node = this; // the root of the tree
        for (let ch of word) {
            if (!(ch in node.children)) {
                node.children[ch] = new TrieNode(); 
            }
            node = node.children[ch]; // Walk down the tree
        }
        node.data = data;
    }
    *getAllData() { // This method returns an iterator over all data in this subtree
        if (this.data != null) yield this.data;
        // Recursively yield all data in the children's subtrees
        for (const child in this.children) yield* this.children[child].getAllData();
    }
    *find(prefix) { // This method returns an iterator over matches
        let node = this;
        // Find the node where this prefix ends:
        for (let ch of prefix) {
            if (!(ch in node.children)) return; // No matches
            node = node.children[ch];
        }
        // Yield all data in this subtree
        yield* node.getAllData();
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

// Demo
// Create some Customer data:
const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

// Build a trie for the database of customers
const trie = new TrieNode(); // The root node of the trie.
for (const customer of database) {
    trie.addWord(customer.id, customer);
}
// Make a few queries
console.log("query: LQK0333");
for (const customer of trie.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of trie.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of trie.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of trie.find("LQK09")) console.log("found: " + customer);

排序数组

另一种方法是将客户记录存储在排序数组中。 JavaScript 没有这样的数据结构，但是 splice 在 JavaScript 中出奇地快，因此您可以通过在其排序位置插入新条目来维护排序顺序。二分查找可用于定位查找或插入条目的索引：

class SortedArray {
    constructor(keyField) {
        this.arr = [];
        this.keyField = keyField;
    }
    addObject(obj) {
        const i = this.indexOf(obj[this.keyField]);
        if (this.arr[i]?.[this.keyField] === obj[this.keyField]) throw "Duplicate not added";
        this.arr.splice(i, 0, obj);
    }
    *find(prefix) { // This method returns an iterator over matches
        for (let i = this.indexOf(prefix); i < this.arr.length; i++) {
            const obj = this.arr[i];
            if (!obj[this.keyField].startsWith(prefix)) return;
            yield obj;
        }
    }
    indexOf(key) {
        let low = 0, high = this.arr.length;
        while (low < high) {
            const mid = (low + high) >> 1;
            if (key === this.arr[mid][this.keyField]) return mid;
            if (key > this.arr[mid][this.keyField]) {
                low = mid + 1;
            } else {
                high = mid;
            }
        }
        return low;
    }
}

class Customer {
    constructor(id, name) {
        this.id = id;
        this.name = name;
    }
    toString() {
        return this.name + " (" + this.id + ")";
    }
}

const database = [
    new Customer('LQK0333208', 'Hanna'),
    new Customer('LQK0333311', 'Bert'),
    new Customer('LQK0339999', 'Joline'),
    new Customer('HCK1646129', 'Sarah'),
    new Customer('HCK1646130', 'Pete'),
    new Customer('HCK1700012', 'Cristine')
];

const arr = new SortedArray("id");
for (const customer of database) {
    arr.addObject(customer);
}
console.log("query: LQK0333");
for (const customer of arr.find("LQK0333")) console.log("found: " + customer);
console.log("query: HCK16461");
for (const customer of arr.find("HCK16461")) console.log("found: " + customer);
console.log("query: LQK0339999");
for (const customer of arr.find("LQK0339999")) console.log("found: " + customer);
console.log("query: LQK09 should not yield results");
for (const customer of arr.find("LQK09")) console.log("found: " + customer);

最适合前缀匹配搜索的数据结构

Best suited data structure for prefix matching search

search

partial

data-structures

排序数组