当许多 unordered_map<string, double> 具有完全相同的字符串设置为键时如何节省内存

Question

我正在使用特征进行class化。

每个特征组是 unordered_map<string, double>。 string 是特征名称，double 是特征值。

class FeatureGroup {
      private:
        unordered_map<string, double> features_ = unordered_map<string, double>{
                { "c_n_a", 0 },
                { "c_n_b", 0 },
                { "l_1_a_1mm", 0 },
                { "l_2_a_1mm", 0 },
                { "l_3_a_1mm", 0 },
                ...
            }
    }

每个实例都有一个功能组。而且，我有很多（比如说 8000000）个实例。

我的问题是：想节省内存不费吹灰之力。正如您所说，我已经在使用短功能名称了。

由于每个实例的特征名称在实验中都是相同的，我不希望像 "c_n_a"、"c_n_b" 这样的特征名称字符串被存储 8000000 次。

查了一下（比如用char*作为Key类型，std::reference_wrapper<>），还是一头雾水。所以，请帮忙。我应该怎么做才能不存储特征名称 8000000 次从而节省内存？

PS:

我阅读了有关 flyweight 的内容，没有发现它不应该工作的内容。然而，在我如下更改代码后，我的程序变得非常慢。

using flyweight_string = boost::flyweight<std::string>;

class FeatureGroup {
    private:
        unordered_map<flyweight_string, double> features_ = unordered_map<flyweight_string, double>{
                { flyweight_string("c_n_a"), 0 },
                { flyweight_string("c_n_b"), 0 },
                { flyweight_string("l_1_a_1mm"), 0 },
                { flyweight_string("l_2_a_1mm"), 0 },
                { flyweight_string("l_3_a_1mm"), 0 },
                { flyweight_string("l_1_b_1mm"), 0 },
                ...
        }
}

设置和获取功能时，我使用以下格式：

features_[flyweight_string(feature_name)] // feature_name is of string type

在设置特征值的时候我也用了下面这句话来检查特征名是否定义了。如果不是，程序exit(1).

if(features_.find(flyweight_string(feature_name)) != features_.end())

我的程序结构如下。我希望有人能找到使用 boost::flyweight.

后变慢的原因

在我的程序中，每个 Instance (class) 都有一个 ID、FeatureGroup 和 class 标签。我有另一个名为 InstanceManager 的 class，它实际上维护了一个 Instance 容器（即 unordered_set<Instance>）。在我的程序中，我计算所有实例的每个特征，例如一次计算所有实例的 "c_n_a" ，然后更新存储在容器中的相应特征值。计算完所有特征值后，我得到每个实例的特征值，并使用经过训练的模型来预测 class 标签。

实例特征值的设置和获取，使用OpenMP为实例容器并行化。

在 windows 性能监视器中，在更改为 boost::flyweight<std::string> 之前，所有 CPU 核心的利用率几乎达到 100%。换享元后，CUP 利用率下降到 6~7%。毕竟，我的程序变得非常慢。

我不知道为什么并行化不能正常工作，因为从 string 更改为 flyweight_string。还有，如何解决？

Answer 1

您可以创建一个中间查找，将您的字符串键转换为数字，然后将其存储为键。

此函数可以有一个向量，其中向量中的字符串键索引将是生成的数字键。如果字符串键不在向量中，则将其插入到末尾，并 return 此键索引。这种方法的问题是查找需要 O(n)。或者，您可以将数字存储在地图中，其中键是字符串键。

矢量方法：

int StringKeyToNumber(vector<string>& lookup, const string& strKey) {
    auto it = find(begin(lookup), end(lookup), strKey);
    if (it != end(lookup)) {
        return distance(begin(lookup), it);
    }
    lookup.push_back(strKey);
    return look.size() - 1;
}

地图方法：

int StringKeyToNumber(map<string, int>& lookup, const string& strKey) {
    auto it = lookup.find(strKey);
    if (it != end(lookup)) {
        return it->second;
    }
    int newIndex = lookup.size();
    lookup[strKey] = newIndex;
    return newIndex;
}

我不太确定使用 char* 作为键类型，虽然它会降低内存要求，但会是一个很好的解决方案。很容易有两个内容相同但内存位置不同的字符串。

实际上你想要一个你可以断言的值只代表一个字符串表示，这样你只需要存储哈希值。上述解决方案为您提供了保证（至少对于前 2147483647 个字符串）:)

Answer 2

看来您可以负担得起将功能名称硬编码到源代码中。如果是这样，您根本不应该使用字符串 - 请改用枚举：

enum class FeatureName { c_n_a, c_n_b, l_1_a_1mm, l_2_a_1mm, l_3_a_1mm, ... };

class FeatureGroup {
      private:
        std::unordered_map<FeatureName, double> features_ =
            std::unordered_map<FeatureName, double> {
                { FeatureName::c_n_a, 0 },
                { FeatureName::c_n_b, 0 },
                { FeatureName::l_1_a_1mm, 0 },
                { FeatureName::l_2_a_1mm, 0 },
                { FeatureName::l_3_a_1mm, 0 },
                ...
            }
    }

您可能需要在 FeatureName 和字符串之间进行转换的函数。有很多关于如何做到这一点的例子。请注意，枚举器的长度对程序的内存消耗没有影响，因此为了便于阅读，您可以根据需要制作它们。

Answer 3

编辑

最下面是原来的回答内容，随着问题的更新，我正在全面修改。您可以将您的代码修改为

class FeatureGroup {
  private:
    enum{
        c_n_a=0,
        c_n_b,
        ...
        num_features};
    std::vector<double> features_;
}

您应该使用 features(num_features) 初始化 features。例如要访问c_n_b对应的特征，只需使用features_[c_n_b].

这已经是您所能达到的最高效率了。事实上，您甚至不需要尝试缩短特征名称。

flyweight design pattern 解释为

In computer programming, flyweight is a software design pattern. A flyweight is an object that minimizes memory use by sharing as much data as possible with other similar objects; it is a way to use objects in large numbers when a simple repeated representation would use an unacceptable amount of memory.

这里好像很容易用boost::flyweight:

#include <iostream>
#include <unordered_map>

#include <boost/flyweight.hpp>

using fly_str = boost::flyweight<std::string>;

int main()
{   
    std::unordered_map<fly_str, int> m;
    m[fly_str("hello")] = 2;                                                                                                               
}

Answer 4

假设标题说 'exactly the same string set as keys' 您可以先创建单个地图 :

map<string,int> myKeyToPositionMap = {
{ "c_n_a", 1 }, { "c_n_b", 2 },{ "l_1_a_1mm", 3 },{ "l_2_a_1mm", 4 },{ "l_3_a_1mm", 5 }};

并将 FeatureGroup 中的地图替换为矢量

class FeatureGroup {
      private:
vector<double> features_ = {0.2,0.1,0.3,0.5}; };

这样你只得到一张地图，你可以从中获得该向量中相应值的位置，假设你想获得 c_n_b、

的值

int keyForCNB = myKeyToPositionMap.find("c_n_b");
double valueForCNB = featuresGroupInstance->getFeaturesVector.at(keyForCNB);

当许多 unordered_map<string, double> 具有完全相同的字符串设置为键时如何节省内存

How to save memory when many unordered_map<string, double>s have exactly the same string set as key

c++

memory

string

unordered-map