有什么办法可以加快 anagram 求解器的排列创建速度吗？

Question

我目前正在尝试制作一个非常快速的 Anagram 求解器，但现在它因排列的创建而遇到瓶颈。还有另一种方法来完成整个程序或优化排列创建吗？这是我的代码：

#include <string>
#include <vector>
#include <algorithm>
#include <iostream>
#include <fstream>
#include <unordered_set>
#include <vector>
#include <boost/asio/thread_pool.hpp>
#include <boost/asio/post.hpp>



void get_permutations(std::string s, std::vector<std::string> &permutations)
{
    std::sort(s.begin(), s.end());
    do
    {
        permutations.push_back(s);
    } while (std::next_permutation(s.begin(), s.end()));
}


void load_file(std::unordered_set<std::string> &dictionary, std::string filename)
{
    std::ifstream words(filename);
    std::string element;
    while (words >> element)
    {
        std::transform(element.begin(), element.end(), element.begin(), ::tolower);
        dictionary.insert(element);
    }
}

void print_valid(const std::unordered_set<std::string>& dictionary, const std::vector<std::string>::const_iterator start, const std::vector<std::string>::const_iterator stop)
{
    for (auto iter = start; iter != stop; iter++)
    {

        if (dictionary.contains(*iter) == true)
        {
            std::cout << *iter << "\n";
        }
    }
}

int main()
{
    const std::string s = "asdfghjklq";
    std::vector<std::string> permutations;
    boost::asio::thread_pool pool(2);
    std::cout << "Loading english dictionary\n";
    std::unordered_set<std::string> dictionary;
    load_file(dictionary, "words");
    std::cout << "Done\n";

    //std::cout << "Enter the anagram: ";
    //getline(std::cin, s);

    clock_t start = clock();


    get_permutations(s, permutations);
    //std::cout << permutations.size() << std::endl;

    std::cout << "finished permutations\n";

    
    if (permutations.size() > 500000)
    {
        std::cout << "making new\n";
        for (size_t pos = 0; pos < permutations.size(); pos += (permutations.size() / 3))
        {
            boost::asio::post(pool, [&dictionary, &permutations, pos] { print_valid(dictionary, (permutations.begin() + pos), (permutations.begin() + pos + (permutations.size() /3) ) ); });
        }
        pool.join();
    } 
    else
    {
        print_valid(dictionary, permutations.begin(), permutations.end());
    }

    

    clock_t finish = clock();
    double time_elapsed = (finish - start) / static_cast<double>(CLOCKS_PER_SEC);
    std::cout << time_elapsed << "\n";
    std::cout << permutations.size() << std::endl;

    return 0;
}

排列的创建在 get_permutations 线程池可以测试非常大的排列组合

Answer 1

您的排列方法太慢了，特别是因为 n 个不同字符的字符串的排列数量呈超指数级增长。尝试类似散列和相等谓词的方法，其中散列基于排序的字符串，而相等谓词仅测试 2 个字符串的排序版本是否相等。您可以使用 boost::unordered_map 创建自定义哈希函数并将符合字谜的单词添加到键集中。

Answer 2

想一想您将如何手动解决这个问题 - 您如何检查两个词是否是彼此的字谜？

例如：banana <-> aaannb

你会如何在一张纸上解决这个问题？你会创建所有 720 个排列并检查是否有一个匹配吗？或者有更简单、更直观的方法吗？

那么是什么让一个词成为另一个词的变位词，即需要满足什么条件？

一切都与字母数有关。如果两个单词包含的所有字母数量相等，则它们是彼此的变位词。

例如：

banana -> 3x a, 2x n, 1x b
aaannb -> 3x a, 2x n, 1x b
相同的字母很重要，所以它们必须是变位词！

有了这些知识，您能否构建一个不需要迭代所有可能排列的算法？

解决方案

我只建议您在尝试提出自己的优化算法后阅读此书

您只需要构建一个查找-table 字母计数到字典单词，例如：

1x a, 1x n -> ["an"]
3x a, 1x b, 2x n -> ["banana", "nanaba"]
1x a, 1x p, 1x r, 1x t -> ["part", "trap"]
... etc ...

然后您也可以将搜索词分解为字母数，例如banana -> 3x a, 1x b, 2x n 并在查找 table.

中搜索分解

结果将是您可以使用给定字母集合构建的词典中的单词列表 - 即给定字符串的所有可能的字谜。

aussuming 某种名为 letter_counts 的结构，其中包含算法可能看起来像的字母组合：

std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
    // build a lookup map for letter composition -> word
    std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
    for(auto& str : dictionary)
        compositionMap[letter_counts{str}].push_back(str);

    // get all words that are anagrams of the given one
    auto it = compositionMap.find(letter_counts{wordToCheck});
    // no matches in dictionary
    if(it == compositionMap.end())
        return {};

    // list of all anagrams
    auto result = it->second;

    // remove workToCheck from result if it is present
    result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());

    return result;
}

这将在 O(n) 时间内运行并且 space 复杂度为 O(n)，n 是词典中的单词数。

（如果您不将 compositionMap 的构建作为算法的一部分，它将被保护 O(1) 时间）

与基于排列的方法相比，它具有 O(n!) 时间复杂度（或者我喜欢这样称呼它 O(scary)）。

这是一个完整的代码示例，它只处理字母 a-z，但您可以轻松地修改 letter_counts 以使其也适用于其他字符：

godbolt example

#include <string_view>
#include <cctype>
#include <vector>
#include <string>
#include <unordered_map>
#include <iostream>

struct letter_counts {
    static const int num_letters = 26;
    int counts[num_letters];

    explicit letter_counts(std::string_view str) : counts{0} {
        for(char c : str) {
            c = std::tolower(c);
            if(c >= 'a' && c <= 'z')
                counts[c - 'a']++;
        }
    }
};

bool operator==(letter_counts const& lhs, letter_counts const& rhs) {
    for(int i = 0; i < letter_counts::num_letters; i++) {
        if(lhs.counts[i] != rhs.counts[i]) return false;
    }

    return true;
}

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

namespace std {
    template<>
    struct hash<letter_counts> {
        size_t operator()(const letter_counts& letterCounts) const
        {
            size_t result = 0;
            auto hasher = std::hash<int>{};
            for(int i : letterCounts.counts)
                hash_combine(result, hasher(i));

            return result;
        }
    };
}



std::vector<std::string> find_anagrams(std::vector<std::string> const& dictionary, std::string const& wordToCheck) {
    // build a lookup map for letter composition -> word
    std::unordered_map<letter_counts, std::vector<std::string>> compositionMap;
    for(auto& str : dictionary)
        compositionMap[letter_counts{str}].push_back(str);

    // get all words that are anagrams of the given one
    auto it = compositionMap.find(letter_counts{wordToCheck});
    // no matches in dictionary
    if(it == compositionMap.end())
        return {};

    // list of all anagrams
    auto result = it->second;

    // remove workToCheck from result if it is present
    result.erase(std::remove_if(result.begin(), result.end(), [&wordToCheck](std::string const& str) { return str == wordToCheck; }), result.end());

    return result;
}

int main() {
    std::vector<std::string> dict = {
        "banana",
        "nanaba",
        "foobar",
        "bazinga"
    };

    std::string word = "aaannb";

    for(auto& str : find_anagrams(dict, word)) {
        std::cout << str << std::endl;
    }
}

Answer 3

请注意，组合的数量往往会很快变得非常大。如果您按字母顺序对两个单词的字符进行排序，然后排序的字符串匹配，那么这两个单词就是变位词。基于这一事实，我制作了以下示例，将字典放入多重映射中，从而可以快速找到单词的所有字谜。它通过使用按字母顺序排序的输入字符串作为地图的键来实现这一点。

现场演示：https://onlinegdb.com/fXUVZruwq

#include <algorithm>
#include <iostream>
#include <locale>
#include <map>
#include <vector>
#include <set>

// create a class to hold anagram information
class anagram_dictionary_t
{
public:
    
    // create a dictionary based on an input list of words.
    template<typename std::size_t N>
    explicit anagram_dictionary_t(const std::string (&words)[N])
    {
        for (std::string word : words)
        {
            auto key = make_key(word);
            std::string lower{ word };
            to_lower(lower);
            m_anagrams.insert({ key, lower});
        }
    }

    // find all the words that match the anagram
    auto find_words(const std::string& anagram)
    {
        // get the unique key for input word
        // this is done by sorting all the characters in the input word alphabetically
        auto key = make_key(anagram);

        // lookup all the words with the same key in the dictionary
        auto range = m_anagrams.equal_range(key);

        // create a set of found words
        std::set<std::string> words;
        for (auto it = range.first; it != range.second; ++it)
        {
            words.insert(it->second);
        }

        // return the words
        return words;
    }

    // function to check if two words are an anagram
    bool is_anagram(const std::string& anagram, const std::string& word)
    {
        auto words = find_words(anagram);
        return (words.find(word) != words.end());
    }

private:
    // make a unique key out of an input word
    // all anagrams should map to the same key value
    static std::string make_key(const std::string& word)
    {
        std::string key{ word };
        to_lower(key);

        // two words are anagrams if they sort to the same key
        std::sort(key.begin(), key.end());
        return key;
    }

    static void to_lower(std::string& word)
    {
        for (char& c : word)
        {
            c = std::tolower(c, std::locale());
        }
    }

    std::multimap<std::string, std::string> m_anagrams;
};

int main()
{
    anagram_dictionary_t anagram_dictionary{ {"Apple", "Apricot", "Avocado", "Banana", "Bilberry", "Blackberry", "Blueberry" } };
    
    std::string anagram{ "aaannb"};
    auto words = anagram_dictionary.find_words(anagram);
    
    std::cout << "input word = " << anagram << "\n found words : ";
    for (const auto& word : words)
    {
        std::cout << word << "\n";
    }

    return 0;
}

有什么办法可以加快 anagram 求解器的排列创建速度吗？

is there any way to speed up the permuation creation for an anagramsolver?

c++

permutation

anagram

解决方案