如何在 Boost MultiIndex 中找到最常出现的非唯一键?
How to find most commonly occurring non-unique keys in Boost MultiIndex?
Boost MultiIndex 容器,当定义为具有 hashed_non_unique
个键时,可以将等效键组合在一起,并且 return 它们全部针对 equal_range
查询,如前所述 here。但我看不到查询集合中最大范围(或 n 个最大范围)的方法。如果不比较不同哈希的范围大小,这在计算上会变得非常昂贵,有没有办法查询最大的相等范围?
如果我们考虑一个简单的例子,比如this one,我想按频率查询,第一个结果是Tom
,然后是Jack
和Leo
排名不分先后。
看起来你可能会在这里得到一个类似 std::map<K, std::vector<V> >
的界面。
你仍然需要数数。
To have the counting done "magically" you might consider making the "bucket key" a refcounting type.
This would be more magical than I'd be comfortable with for my code-bases. In particular, copied elements could easily cause overcounting.
方法一:BMI + RangeV3 语法糖
警告:我认为这是“高级”,因为学习曲线可能很陡峭。但是,当您轻松使用 Ranges 时,这可以极大地提高工作效率。
另请注意,这绝不保证会提高性能。但您应该注意,没有元素被复制,向量 (groups
) 仅包含子范围,这些子范围是多索引容器中的迭代器范围。
#include <boost/multi_index/composite_key.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index_container.hpp>
#include <iostream>
#include <iomanip>
#include <range/v3/all.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
namespace bmi = boost::multi_index;
namespace vw = ranges::views;
namespace act = ranges::actions;
struct Person {
int m_id;
std::string m_name;
friend std::ostream& operator<<(std::ostream& os, Person const& p) {
return os << "[" << p.m_id << ", " << std::quoted(p.m_name) << "]";
}
};
typedef bmi::multi_index_container<
Person,
bmi::indexed_by<
bmi::ordered_unique<bmi::member<Person, int, &Person::m_id>>,
bmi::ordered_unique<
bmi::tag<struct by_name_id>,
bmi::composite_key<Person,
bmi::member<Person, std::string, &Person::m_name>,
bmi::member<Person, int, &Person::m_id>>
>
> >
Roster;
template <typename Index, typename KeyExtractor>
std::size_t distinct(const Index& i, KeyExtractor key) {
std::size_t res = 0;
for (auto it = i.begin(), it_end = i.end(); it != it_end;) {
++res;
it = i.upper_bound(key(*it));
}
return res;
}
int main() {
Roster const r {
{1, "Tom"},
{2, "Jack"},
{3, "Tom"},
{4, "Leo"}
};
fmt::print("Roster: {}\n", r);
static constexpr auto eq_ = std::equal_to<>{};
static constexpr auto name_ = std::mem_fn(&Person::m_name);
static constexpr auto size_ = [](auto const& r) constexpr { return std::distance(begin(r), end(r)); };
auto& idx = r.get<by_name_id>();
fmt::print("Distinct: {}, Index: {}\n", distinct(idx, name_), idx);
auto by_name_ = vw::group_by([](auto const&... arg) { return eq_(name_(arg)...); });
auto by_size_ = [](auto const&... subrange) { return (size_(subrange) > ...); };
auto groups = idx | by_name_ | ranges::to_vector;
for (auto&& x : groups |= act::sort(by_size_)) {
fmt::print("#{} persons in group {}: {}\n",
size_(x),
name_(ranges::front(x)),
x);
}
}
打印:
Roster: {[1, "Tom"], [2, "Jack"], [3, "Tom"], [4, "Leo"]}
Distinct: 3, Index: {[2, "Jack"], [4, "Leo"], [1, "Tom"], [3, "Tom"]}
#2 persons in group Tom: {[1, "Tom"], [3, "Tom"]}
#1 persons in group Jack: {[2, "Jack"]}
#1 persons in group Leo: {[4, "Leo"]}
请注意,我只是保留了 original link. You could drop it 中的 distinct()
函数以消除一些噪音。
方法 2:相同,但 w/o 提升
Multi-index现在好像只提供ordered container,所以我们简化一下:
#include <set>
#include <iostream>
#include <iomanip>
#include <range/v3/all.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
namespace vw = ranges::views;
namespace act = ranges::actions;
struct Person {
int m_id;
std::string m_name;
friend std::ostream& operator<<(std::ostream& os, Person const& p) {
return os << "[" << p.m_id << ", " << std::quoted(p.m_name) << "]";
}
bool operator<(Person const& o) const { return m_name < o.m_name; }
};
int main() {
std::multiset<Person> const r {
{1, "Tom"},
{2, "Jack"},
{3, "Tom"},
{4, "Leo"}
};
fmt::print("Roster: {}\n", r);
static constexpr auto eq_ = std::equal_to<>{};
static constexpr auto name_ = std::mem_fn(&Person::m_name);
static constexpr auto size_ = [](auto const& r) constexpr { return std::distance(begin(r), end(r)); };
auto by_name_ = vw::group_by([](auto const&... arg) { return eq_(name_(arg)...); });
auto by_size_ = [](auto const&... subrange) { return (size_(subrange) > ...); };
auto groups = r | by_name_ | ranges::to_vector;
for (auto&& x : groups |= act::sort(by_size_)) {
fmt::print("#{} persons in group {}: {}\n",
size_(x),
name_(ranges::front(x)),
x);
}
}
版画
Roster: {[2, "Jack"], [4, "Leo"], [1, "Tom"], [3, "Tom"]}
#2 persons in group Tom: {[1, "Tom"], [3, "Tom"]}
#1 persons in group Jack: {[2, "Jack"]}
#1 persons in group Leo: {[4, "Leo"]}
Bonus: Slightly more simplified assuming equality operator on Person suffices: https://godbolt.org/z/58xsTK
好的,如果您使用的是非唯一散列索引,结果是 equal_range
确实 而不是 对返回范围内的所有元素调用相等比较(不同于std::unordered_multimap
, BTW) 的常见实现,因此以下内容可能非常有效:
template<typename HashIndex>
std::multimap<
std::size_t,
std::reference_wrapper<const typename HashIndex::value_type>,
std::greater<std::size_t>
> group_sizes(const HashIndex& i)
{
decltype(group_sizes(i)) res;
for(auto it=i.begin(),end=i.end();it!=end;){
auto next=i.equal_range(*it).second;
res.emplace((std::size_t)std::distance(it,next),*it);
it=next;
}
return res;
}
要检查这实际上有多有效,让我们尝试检测元素类型:
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <cstring>
#include <functional>
#include <iostream>
#include <string>
#include <tuple>
#include <map>
template<typename HashIndex>
std::multimap<
std::size_t,
std::reference_wrapper<const typename HashIndex::value_type>,
std::greater<std::size_t>
> group_sizes(const HashIndex& i)
{
decltype(group_sizes(i)) res;
for(auto it=i.begin(),end=i.end();it!=end;){
auto next=i.equal_range(*it).second;
res.emplace((std::size_t)std::distance(it,next),*it);
it=next;
}
return res;
}
struct instrumented_string:std::string
{
using std::string::string;
static void reset_nums()
{
num_hashes=0;
num_eqs=0;
}
static std::size_t num_hashes,num_eqs;
};
std::size_t instrumented_string::num_hashes=0;
std::size_t instrumented_string::num_eqs=0;
bool operator==(const instrumented_string& x,const instrumented_string& y)
{
++instrumented_string::num_eqs;
return static_cast<std::string>(x)==y;
}
std::size_t hash_value(const instrumented_string& x)
{
++instrumented_string::num_hashes;
return boost::hash<std::string>{}(x);
}
using namespace boost::multi_index;
using container=multi_index_container<
instrumented_string,
indexed_by<
hashed_non_unique<identity<instrumented_string>>
>
>;
int main()
{
auto values={"Tom","Jack","Leo","Bjarne","Subhamoy"};
container c;
for(auto& v:values){
for(auto i=100*std::strlen(v);i--;)c.insert(v);
}
instrumented_string::reset_nums();
auto gs=group_sizes(c);
for(const auto& g:gs){
std::cout<<g.first<<": "<<g.second.get()<<"\n";
}
std::cout<<"# hashes: "<<instrumented_string::num_hashes<<"\n";
std::cout<<"# eqs: "<<instrumented_string::num_eqs<<"\n";
}
输出
800: Subhamoy
600: Bjarne
400: Jack
300: Tom
300: Leo
# hashes: 5
# eqs: 5
因此,对于一个包含 2,400 个元素的容器,调用 group_sizes
只进行了 5 次哈希计算和 5 次相等性比较(当然还有约 2,400 次迭代器增量)。
如果你真的想摆脱哈希,可以这样做:
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <cstring>
#include <functional>
#include <iostream>
#include <memory>
#include <string>
#include <map>
template<typename HashIndex>
struct internal_reference
{
const HashIndex& i;
const typename HashIndex::value_type& r;
std::size_t buc;
};
template<typename HashIndex>
struct internal_reference_equal_to
{
bool operator()(
const typename HashIndex::value_type& x,
const internal_reference<HashIndex>& y)const
{
return
std::addressof(x)==std::addressof(y.r)||
y.i.key_eq()(y.i.key_extractor()(x),y.i.key_extractor()(y.r));
}
bool operator()(
const internal_reference<HashIndex>& x,
const typename HashIndex::value_type& y)const
{
return (*this)(y,x);
}
};
template<typename HashIndex>
struct internal_reference_hash
{
std::size_t operator()(const internal_reference<HashIndex>& x)const
{
return x.buc;
}
};
template<typename HashIndex>
std::multimap<
std::size_t,
std::reference_wrapper<const typename HashIndex::value_type>,
std::greater<std::size_t>
> group_sizes(const HashIndex& i)
{
decltype(group_sizes(i)) res;
for(std::size_t buc=0,buc_count=i.bucket_count();buc<buc_count;++buc){
for(auto it=i.begin(buc),end=i.end(buc);it!=end;){
auto p=i.equal_range(
internal_reference<HashIndex>{i,*it,buc},
internal_reference_hash<HashIndex>{},
internal_reference_equal_to<HashIndex>{});
std::size_t dist=0;
auto next=it;
while(p.first!=p.second){
++p.first;
++dist;
++next;
}
res.emplace(dist,*it);
it=next;
}
}
return res;
}
struct instrumented_string:std::string
{
using std::string::string;
static void reset_nums()
{
num_hashes=0;
num_eqs=0;
}
static std::size_t num_hashes,num_eqs;
};
std::size_t instrumented_string::num_hashes=0;
std::size_t instrumented_string::num_eqs=0;
bool operator==(const instrumented_string& x,const instrumented_string& y)
{
++instrumented_string::num_eqs;
return static_cast<std::string>(x)==y;
}
std::size_t hash_value(const instrumented_string& x)
{
++instrumented_string::num_hashes;
return boost::hash<std::string>{}(x);
}
using namespace boost::multi_index;
using container=multi_index_container<
instrumented_string,
indexed_by<
hashed_non_unique<identity<instrumented_string>>
>
>;
int main()
{
auto values={"Tom","Jack","Leo","Bjarne","Subhamoy"};
container c;
for(auto& v:values){
for(auto i=100*std::strlen(v);i--;)c.insert(v);
}
instrumented_string::reset_nums();
auto gs=group_sizes(c);
for(const auto& g:gs){
std::cout<<g.first<<": "<<g.second.get()<<"\n";
}
std::cout<<"# hashes: "<<instrumented_string::num_hashes<<"\n";
std::cout<<"# eqs: "<<instrumented_string::num_eqs<<"\n";
}
输出
800: Subhamoy
600: Bjarne
400: Jack
300: Tom
300: Leo
# hashes: 0
# eqs: 0
但请记住,此版本的 group_sizes
利用了 未记录的 事实,即哈希值 h
的元素被放置在存储桶 [=20] 中=](或者,换句话说,internal_reference<HashIndex>
哈希在技术上不符合索引哈希函数的 compatible extension)。
Boost MultiIndex 容器,当定义为具有 hashed_non_unique
个键时,可以将等效键组合在一起,并且 return 它们全部针对 equal_range
查询,如前所述 here。但我看不到查询集合中最大范围(或 n 个最大范围)的方法。如果不比较不同哈希的范围大小,这在计算上会变得非常昂贵,有没有办法查询最大的相等范围?
如果我们考虑一个简单的例子,比如this one,我想按频率查询,第一个结果是Tom
,然后是Jack
和Leo
排名不分先后。
看起来你可能会在这里得到一个类似 std::map<K, std::vector<V> >
的界面。
你仍然需要数数。
To have the counting done "magically" you might consider making the "bucket key" a refcounting type.
This would be more magical than I'd be comfortable with for my code-bases. In particular, copied elements could easily cause overcounting.
方法一:BMI + RangeV3 语法糖
警告:我认为这是“高级”,因为学习曲线可能很陡峭。但是,当您轻松使用 Ranges 时,这可以极大地提高工作效率。
另请注意,这绝不保证会提高性能。但您应该注意,没有元素被复制,向量 (groups
) 仅包含子范围,这些子范围是多索引容器中的迭代器范围。
#include <boost/multi_index/composite_key.hpp>
#include <boost/multi_index/member.hpp>
#include <boost/multi_index/ordered_index.hpp>
#include <boost/multi_index_container.hpp>
#include <iostream>
#include <iomanip>
#include <range/v3/all.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
namespace bmi = boost::multi_index;
namespace vw = ranges::views;
namespace act = ranges::actions;
struct Person {
int m_id;
std::string m_name;
friend std::ostream& operator<<(std::ostream& os, Person const& p) {
return os << "[" << p.m_id << ", " << std::quoted(p.m_name) << "]";
}
};
typedef bmi::multi_index_container<
Person,
bmi::indexed_by<
bmi::ordered_unique<bmi::member<Person, int, &Person::m_id>>,
bmi::ordered_unique<
bmi::tag<struct by_name_id>,
bmi::composite_key<Person,
bmi::member<Person, std::string, &Person::m_name>,
bmi::member<Person, int, &Person::m_id>>
>
> >
Roster;
template <typename Index, typename KeyExtractor>
std::size_t distinct(const Index& i, KeyExtractor key) {
std::size_t res = 0;
for (auto it = i.begin(), it_end = i.end(); it != it_end;) {
++res;
it = i.upper_bound(key(*it));
}
return res;
}
int main() {
Roster const r {
{1, "Tom"},
{2, "Jack"},
{3, "Tom"},
{4, "Leo"}
};
fmt::print("Roster: {}\n", r);
static constexpr auto eq_ = std::equal_to<>{};
static constexpr auto name_ = std::mem_fn(&Person::m_name);
static constexpr auto size_ = [](auto const& r) constexpr { return std::distance(begin(r), end(r)); };
auto& idx = r.get<by_name_id>();
fmt::print("Distinct: {}, Index: {}\n", distinct(idx, name_), idx);
auto by_name_ = vw::group_by([](auto const&... arg) { return eq_(name_(arg)...); });
auto by_size_ = [](auto const&... subrange) { return (size_(subrange) > ...); };
auto groups = idx | by_name_ | ranges::to_vector;
for (auto&& x : groups |= act::sort(by_size_)) {
fmt::print("#{} persons in group {}: {}\n",
size_(x),
name_(ranges::front(x)),
x);
}
}
打印:
Roster: {[1, "Tom"], [2, "Jack"], [3, "Tom"], [4, "Leo"]}
Distinct: 3, Index: {[2, "Jack"], [4, "Leo"], [1, "Tom"], [3, "Tom"]}
#2 persons in group Tom: {[1, "Tom"], [3, "Tom"]}
#1 persons in group Jack: {[2, "Jack"]}
#1 persons in group Leo: {[4, "Leo"]}
请注意,我只是保留了 original link. You could drop it 中的 distinct()
函数以消除一些噪音。
方法 2:相同,但 w/o 提升
Multi-index现在好像只提供ordered container,所以我们简化一下:
#include <set>
#include <iostream>
#include <iomanip>
#include <range/v3/all.hpp>
#include <fmt/ranges.h>
#include <fmt/ostream.h>
namespace vw = ranges::views;
namespace act = ranges::actions;
struct Person {
int m_id;
std::string m_name;
friend std::ostream& operator<<(std::ostream& os, Person const& p) {
return os << "[" << p.m_id << ", " << std::quoted(p.m_name) << "]";
}
bool operator<(Person const& o) const { return m_name < o.m_name; }
};
int main() {
std::multiset<Person> const r {
{1, "Tom"},
{2, "Jack"},
{3, "Tom"},
{4, "Leo"}
};
fmt::print("Roster: {}\n", r);
static constexpr auto eq_ = std::equal_to<>{};
static constexpr auto name_ = std::mem_fn(&Person::m_name);
static constexpr auto size_ = [](auto const& r) constexpr { return std::distance(begin(r), end(r)); };
auto by_name_ = vw::group_by([](auto const&... arg) { return eq_(name_(arg)...); });
auto by_size_ = [](auto const&... subrange) { return (size_(subrange) > ...); };
auto groups = r | by_name_ | ranges::to_vector;
for (auto&& x : groups |= act::sort(by_size_)) {
fmt::print("#{} persons in group {}: {}\n",
size_(x),
name_(ranges::front(x)),
x);
}
}
版画
Roster: {[2, "Jack"], [4, "Leo"], [1, "Tom"], [3, "Tom"]}
#2 persons in group Tom: {[1, "Tom"], [3, "Tom"]}
#1 persons in group Jack: {[2, "Jack"]}
#1 persons in group Leo: {[4, "Leo"]}
Bonus: Slightly more simplified assuming equality operator on Person suffices: https://godbolt.org/z/58xsTK
好的,如果您使用的是非唯一散列索引,结果是 equal_range
确实 而不是 对返回范围内的所有元素调用相等比较(不同于std::unordered_multimap
, BTW) 的常见实现,因此以下内容可能非常有效:
template<typename HashIndex>
std::multimap<
std::size_t,
std::reference_wrapper<const typename HashIndex::value_type>,
std::greater<std::size_t>
> group_sizes(const HashIndex& i)
{
decltype(group_sizes(i)) res;
for(auto it=i.begin(),end=i.end();it!=end;){
auto next=i.equal_range(*it).second;
res.emplace((std::size_t)std::distance(it,next),*it);
it=next;
}
return res;
}
要检查这实际上有多有效,让我们尝试检测元素类型:
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <cstring>
#include <functional>
#include <iostream>
#include <string>
#include <tuple>
#include <map>
template<typename HashIndex>
std::multimap<
std::size_t,
std::reference_wrapper<const typename HashIndex::value_type>,
std::greater<std::size_t>
> group_sizes(const HashIndex& i)
{
decltype(group_sizes(i)) res;
for(auto it=i.begin(),end=i.end();it!=end;){
auto next=i.equal_range(*it).second;
res.emplace((std::size_t)std::distance(it,next),*it);
it=next;
}
return res;
}
struct instrumented_string:std::string
{
using std::string::string;
static void reset_nums()
{
num_hashes=0;
num_eqs=0;
}
static std::size_t num_hashes,num_eqs;
};
std::size_t instrumented_string::num_hashes=0;
std::size_t instrumented_string::num_eqs=0;
bool operator==(const instrumented_string& x,const instrumented_string& y)
{
++instrumented_string::num_eqs;
return static_cast<std::string>(x)==y;
}
std::size_t hash_value(const instrumented_string& x)
{
++instrumented_string::num_hashes;
return boost::hash<std::string>{}(x);
}
using namespace boost::multi_index;
using container=multi_index_container<
instrumented_string,
indexed_by<
hashed_non_unique<identity<instrumented_string>>
>
>;
int main()
{
auto values={"Tom","Jack","Leo","Bjarne","Subhamoy"};
container c;
for(auto& v:values){
for(auto i=100*std::strlen(v);i--;)c.insert(v);
}
instrumented_string::reset_nums();
auto gs=group_sizes(c);
for(const auto& g:gs){
std::cout<<g.first<<": "<<g.second.get()<<"\n";
}
std::cout<<"# hashes: "<<instrumented_string::num_hashes<<"\n";
std::cout<<"# eqs: "<<instrumented_string::num_eqs<<"\n";
}
输出
800: Subhamoy
600: Bjarne
400: Jack
300: Tom
300: Leo
# hashes: 5
# eqs: 5
因此,对于一个包含 2,400 个元素的容器,调用 group_sizes
只进行了 5 次哈希计算和 5 次相等性比较(当然还有约 2,400 次迭代器增量)。
如果你真的想摆脱哈希,可以这样做:
#include <boost/multi_index_container.hpp>
#include <boost/multi_index/hashed_index.hpp>
#include <boost/multi_index/identity.hpp>
#include <cstring>
#include <functional>
#include <iostream>
#include <memory>
#include <string>
#include <map>
template<typename HashIndex>
struct internal_reference
{
const HashIndex& i;
const typename HashIndex::value_type& r;
std::size_t buc;
};
template<typename HashIndex>
struct internal_reference_equal_to
{
bool operator()(
const typename HashIndex::value_type& x,
const internal_reference<HashIndex>& y)const
{
return
std::addressof(x)==std::addressof(y.r)||
y.i.key_eq()(y.i.key_extractor()(x),y.i.key_extractor()(y.r));
}
bool operator()(
const internal_reference<HashIndex>& x,
const typename HashIndex::value_type& y)const
{
return (*this)(y,x);
}
};
template<typename HashIndex>
struct internal_reference_hash
{
std::size_t operator()(const internal_reference<HashIndex>& x)const
{
return x.buc;
}
};
template<typename HashIndex>
std::multimap<
std::size_t,
std::reference_wrapper<const typename HashIndex::value_type>,
std::greater<std::size_t>
> group_sizes(const HashIndex& i)
{
decltype(group_sizes(i)) res;
for(std::size_t buc=0,buc_count=i.bucket_count();buc<buc_count;++buc){
for(auto it=i.begin(buc),end=i.end(buc);it!=end;){
auto p=i.equal_range(
internal_reference<HashIndex>{i,*it,buc},
internal_reference_hash<HashIndex>{},
internal_reference_equal_to<HashIndex>{});
std::size_t dist=0;
auto next=it;
while(p.first!=p.second){
++p.first;
++dist;
++next;
}
res.emplace(dist,*it);
it=next;
}
}
return res;
}
struct instrumented_string:std::string
{
using std::string::string;
static void reset_nums()
{
num_hashes=0;
num_eqs=0;
}
static std::size_t num_hashes,num_eqs;
};
std::size_t instrumented_string::num_hashes=0;
std::size_t instrumented_string::num_eqs=0;
bool operator==(const instrumented_string& x,const instrumented_string& y)
{
++instrumented_string::num_eqs;
return static_cast<std::string>(x)==y;
}
std::size_t hash_value(const instrumented_string& x)
{
++instrumented_string::num_hashes;
return boost::hash<std::string>{}(x);
}
using namespace boost::multi_index;
using container=multi_index_container<
instrumented_string,
indexed_by<
hashed_non_unique<identity<instrumented_string>>
>
>;
int main()
{
auto values={"Tom","Jack","Leo","Bjarne","Subhamoy"};
container c;
for(auto& v:values){
for(auto i=100*std::strlen(v);i--;)c.insert(v);
}
instrumented_string::reset_nums();
auto gs=group_sizes(c);
for(const auto& g:gs){
std::cout<<g.first<<": "<<g.second.get()<<"\n";
}
std::cout<<"# hashes: "<<instrumented_string::num_hashes<<"\n";
std::cout<<"# eqs: "<<instrumented_string::num_eqs<<"\n";
}
输出
800: Subhamoy
600: Bjarne
400: Jack
300: Tom
300: Leo
# hashes: 0
# eqs: 0
但请记住,此版本的 group_sizes
利用了 未记录的 事实,即哈希值 h
的元素被放置在存储桶 [=20] 中=](或者,换句话说,internal_reference<HashIndex>
哈希在技术上不符合索引哈希函数的 compatible extension)。