如何方便快捷的存储一个大词库？

Question

我目前正在学校项目中开发 C++ 拼写检查器。对于检查单词是否存在的部分，我目前执行以下操作：

我在网上找到一个 .txt 文件，其中包含所有现有的英文单词
我的脚本首先遍历这些文本文件并将每个它在地图对象中的条目，以便于访问。

这种方法的问题是当程序启动时，步骤 2) 大约需要 20 秒。这本身没什么大不了的，但我想知道你们中是否有人有其他方法可以快速使用我的单词数据库。例如，是否有一种方法可以将地图对象存储在一个文件中，这样我就不需要每次都从文本文件中构建它？

Answer 1

如果你的全是英文的文件不是动态的，你可以直接存成静态图。为此，您需要解析 .txt 文件，例如：

alpha

beta

gamma

...

将其转换成如下形式：

static std::map<std::string,int> wordDictionary = {
                { "alpha", 0 },
                { "beta", 0 },
                { "gamma", 0 } 
                   ... };

您可以通过编程或简单地在您最喜欢的文本编辑器中使用查找和替换来完成。

您的 .exe 将比以前重得多，但它的启动速度也将比从文件中读取此信息快得多。

Answer 2

令我有点惊讶的是，还没有人提出连载的想法。 Boost 为这样的解决方案提供了强大的支持。如果我理解正确的话，问题是无论何时您使用您的应用程序，读取您的单词列表（并将它们放入希望提供快速查找操作的数据结构）都需要很长时间。构建这样的结构，然后将其保存到二进制文件中供以后重用应该会提高应用程序的性能（基于下面显示的结果）。

这里有一段代码（同时也是一个最小的工作示例）可能会帮助您解决这个问题。

#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp> 

#include "prettyprint.hpp"

class Dictionary {
public:
  Dictionary() = default;
  Dictionary(std::string const& file_)
    : _file(file_)
  {}

  inline size_t size() const { return _words.size(); }

  void build_wordset()
  {
    if (!_file.size()) { throw std::runtime_error("No file to read!"); }

    std::ifstream infile(_file);
    std::string line;

    while (std::getline(infile, line)) {
      _words.insert(line);
    }
  }

  friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
  {
    os << d._words;  // cxx-prettyprint used here
    return os;
  }

  int save(std::string const& out_file) 
  { 
    std::ofstream ofs(out_file.c_str(), std::ios::binary);
    if (ofs.fail()) { return -1; }

    boost::archive::binary_oarchive oa(ofs); 
    oa << _words;
    return 0;
  }

  int load(std::string const& in_file)
  {
    _words.clear();

    std::ifstream ifs(in_file);
    if (ifs.fail()) { return -1; }

    boost::archive::binary_iarchive ia(ifs);
    ia >> _words;
    return 0;
  }

private:
  friend class boost::serialization::access;

  template <typename Archive>
  void serialize(Archive& ar, const unsigned int version)
  {
    ar & _words;
  }

private:
  std::string           _file;
  std::set<std::string> _words;
};

void create_new_dict()
{
  std::string const in_file("words.txt");
  std::string const ser_dict("words.set");

  Dictionary d(in_file);

  auto start = std::chrono::system_clock::now();
  d.build_wordset();
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Building up the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;

  d.save(ser_dict);
}

void use_existing_dict()
{
  std::string const ser_dict("words.set");

  Dictionary d;

  auto start = std::chrono::system_clock::now();
  d.load(ser_dict);
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Loading in the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;
}

int main()
{
  create_new_dict();
  use_existing_dict();
  return 0;
}

很抱歉没有将代码放在单独的文件中，而且设计不佳；但是，为了演示目的，它应该足够了。

请注意，我没有使用地图：我只是看不出存储大量零或其他任何不必要的东西有什么意义。 AFAIK，std::set 由与 std::maps.

相同的强大 RB 树支持

对于可用的数据集here（它包含大约 466k 个单词），我得到以下结果：

Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544

依赖关系：

Boost's Serialization component（不过我用的是1.58版本）
louisdx/cxx-prettyprint.

希望这对您有所帮助。 :) 干杯。

Answer 3

要事第一。不要使用映射（或集合）来存储单词列表。使用字符串向量，确保其内容已排序（我相信您的单词列表已经排序），然后使用 binary_find 从 header 来检查一个词是否已经在字典中。

尽管这可能仍然高度次优（取决于您的编译器是否进行了小的字符串优化），您的加载时间将至少提高一个数量级。做一个基准测试，如果你想让它更快，post另一个关于字符串向量的问题。

如何方便快捷的存储一个大词库？

How can I easily and quickly store a big word database?

c++

database

dictionary

words