在 boost::ireplace 中是否可以像对待基本字符一样对待特殊字符? (例如,'ź' 为 'z')
Is it possible in boost::ireplace to treat special characters like basic characters? (eg. 'ź' as 'z')
所以,我正在制作一个单词过滤器,用星号替换坏单词,但是如果使用 ąężźć 等特殊字符,单词的组合可能会很多。
如何让 boost::ireplace_all 将它们视为基本字符 aezzc?
所以
boost::ireplace_all("żąć", "a", "*");
和 boost::ireplace_all("zac", "a", "*");
会分别导致 ż*ć
和 z*c
吗?
Edit/Extended 示例:
const std::set<std::string> badwords =
{
"<not nice word>",
"<another not nice word>"
};
void FilterBadWords(std::string& s)
{
for (auto &badword : badwords)
boost::ireplace_all(s, badword, "*");
}
int main()
{
std::string a("hello you <not nice word> person");
std::string b("hęlló you <nót Nićę wórd> person");
FilterBadWords(a);
FilterBadWords(b);
//a equals "hello you * person"
//b equals "hęlló you * person"
//or as many * as the replaced string lenght, both are fine
}
Boost Locale 通过 ICU 支持主要排序规则:
事实证明,让它发挥作用非常棘手。基本上,使用 char
字符串你就完蛋了,因为 Boost 字符串算法对代码点一无所知,只是逐字节迭代( 和比较 )输入序列(好吧, char
by char
, 但这里有点混乱)。
因此,解决方案在于转换为 utf32 字符串(这对于 GCC 使用 std::wstring
是可能的,因为 wchar_t
是 32 位)。 Utf16 通常也应该 "work" 但它仍然存在我刚才概述的遍历问题,只是很少见。
现在,我创建了一个快捷的自定义 Finder 谓词:
template <typename CharT>
struct is_primcoll_equal
{
is_primcoll_equal(const std::locale& Loc=std::locale()) :
m_Loc(Loc), comp(Loc, boost::locale::collator_base::primary) {}
template< typename T1, typename T2 >
bool operator()(const T1& Arg1, const T2& Arg2) const {
// TODO use `do_compare` methods on the collation itself that
// don't construct basic_string<> instances
return 0 == comp(std::basic_string<CharT>(1, Arg1), std::basic_string<CharT>(1, Arg2));
}
private:
std::locale m_Loc;
boost::locale::comparator<CharT> comp;
};
它效率极低,因为它每次调用都构造单字符字符串。这是因为 do_compare
方法不是 collator<>
的 public API 的一部分。我离开派生自定义 collator<>
并将其用作 reader.
的练习
接下来,我们通过包装 find_format_all
来模仿 replace_all
接口:
template<typename SequenceT, typename Range1T, typename Range2T>
inline void collate_replace_all(
SequenceT& Input,
const Range1T& Search,
const Range2T& Format,
const std::locale& Loc=std::locale() )
{
::boost::algorithm::find_format_all(
Input,
::boost::algorithm::first_finder(Search, is_primcoll_equal<typename SequenceT::value_type>(Loc)),
::boost::algorithm::const_formatter(Format) );
}
现在我们只需要字符串加宽转换就可以了:
void FilterBadWords(std::string& s) {
using namespace boost::locale::conv;
std::wstring widened = utf_to_utf<wchar_t>(s, stop);
for (auto& badword : badwords) {
detail::collate_replace_all(widened, badword, L"*"/*, loc*/);
}
s = utf_to_utf<char>(widened, stop);
}
完整节目
#include <boost/algorithm/string/replace.hpp>
#include <boost/locale.hpp>
#include <iostream>
#include <locale>
#include <set>
#include <string>
const std::set<std::string> badwords =
{
"<not nice word>",
"<another not nice word>"
};
namespace detail {
template <typename CharT>
struct is_primcoll_equal
{
is_primcoll_equal(const std::locale& Loc=std::locale()) :
m_Loc(Loc), comp(Loc, boost::locale::collator_base::primary) {}
template< typename T1, typename T2 >
bool operator()(const T1& Arg1, const T2& Arg2) const {
// assert(0 == comp(L"<not nice word>", L"<nót Nićę wórd>"));
// TODO use `do_compare` methods on the collation itself that
// don't construct basic_string<> instances
return 0 == comp(std::basic_string<CharT>(1, Arg1), std::basic_string<CharT>(1, Arg2));
}
private:
std::locale m_Loc;
boost::locale::comparator<CharT> comp;
};
template<typename SequenceT, typename Range1T, typename Range2T>
inline void collate_replace_all(
SequenceT& Input,
const Range1T& Search,
const Range2T& Format,
const std::locale& Loc=std::locale() )
{
::boost::algorithm::find_format_all(
Input,
::boost::algorithm::first_finder(Search, is_primcoll_equal<typename SequenceT::value_type>(Loc)),
::boost::algorithm::const_formatter(Format) );
}
}
void FilterBadWords(std::string& s) {
using namespace boost::locale::conv;
std::wstring widened = utf_to_utf<wchar_t>(s, stop);
for (auto& badword : badwords) {
detail::collate_replace_all(widened, badword, L"*"/*, loc*/);
}
s = utf_to_utf<char>(widened, stop);
}
static_assert(sizeof(wchar_t) == sizeof(uint32_t), "Required for robustness (surrogate pairs, anyone?)");
int main()
{
auto loc = boost::locale::generator().generate("");
std::locale::global(loc);
std::string a("hello you <not nice word> person");
std::string b("hęlló you <nót Nićę wórd> person");
FilterBadWords(a);
FilterBadWords(b);
std::cout << a << "\n";
std::cout << b << "\n";
}
输出
在我的系统上:
hello you * person
hęlló you * person
¹ 显然 Coliru 执行环境中的语言环境支持不完整
作为使用较少提升的附加解决方案(好吧,您可以对其进行编辑以完全删除提升..):
const std::vector<std::string> badwords =
{
"badword1",
"badword2",
"badword3",
"badword4"
};
char PolishReplacement[0xFF];
const std::map<std::string, std::string> PolishReplacementMap =
{
{ "ł","l" },
{ "ą","a" },
{ "ę","e" },
{ "ć","c" },
{ "ż","z" },
{ "ź","z" },
{ "ó","o" },
{ "ś","s" },
{ "ń","n" },
{ "Ł","L" },
{ "Ą","A" },
{ "Ę","E" },
{ "Ć","C" },
{ "Ż","Z" },
{ "Ź","Z" },
{ "Ó","O" },
{ "Ś","S" },
{ "Ń","N" }
};
//preconstruct our array, we love speed gain by paying startup time
struct CPolishReplacementInitHack
{
CPolishReplacementInitHack()
{
for (unsigned char c = 0; c < 0xFF; ++c)
{
char tmpstr[2] = { c, 0 };
std::string tmpstdstr(tmpstr);
auto replacement = PolishReplacementMap.find(tmpstdstr);
if (replacement == PolishReplacementMap.end())
PolishReplacement[c] = boost::to_lower_copy(tmpstdstr)[0];
else
PolishReplacement[c] = boost::to_lower_copy(replacement->second)[0];
}
}
} _CPolishReplacementInitHack;
//actual filtering
void FilterBadWords(std::string& s)
{
std::string sc(s);
for (auto& character : sc)
character = PolishReplacement[(unsigned char)character];
for (auto &badword : badwords)
{
size_t pos = sc.find(badword);
size_t size = badword.size();
size_t possize;
while (pos != std::string::npos)
{
possize = pos + size;
s.replace ( s.begin() + pos, s.begin() + possize, "*");
sc.replace(sc.begin() + pos, sc.begin() + possize, "*");
pos = sc.find(badword);
}
}
}
这可能是不可移植的(Windows + Locale + encoding dependend?),但是 非常 快(200 毫秒/25000 个随机句子,i7,调试,没有优化)。
所以,我正在制作一个单词过滤器,用星号替换坏单词,但是如果使用 ąężźć 等特殊字符,单词的组合可能会很多。
如何让 boost::ireplace_all 将它们视为基本字符 aezzc?
所以
boost::ireplace_all("żąć", "a", "*");
和 boost::ireplace_all("zac", "a", "*");
会分别导致 ż*ć
和 z*c
吗?
Edit/Extended 示例:
const std::set<std::string> badwords =
{
"<not nice word>",
"<another not nice word>"
};
void FilterBadWords(std::string& s)
{
for (auto &badword : badwords)
boost::ireplace_all(s, badword, "*");
}
int main()
{
std::string a("hello you <not nice word> person");
std::string b("hęlló you <nót Nićę wórd> person");
FilterBadWords(a);
FilterBadWords(b);
//a equals "hello you * person"
//b equals "hęlló you * person"
//or as many * as the replaced string lenght, both are fine
}
Boost Locale 通过 ICU 支持主要排序规则:
事实证明,让它发挥作用非常棘手。基本上,使用 char
字符串你就完蛋了,因为 Boost 字符串算法对代码点一无所知,只是逐字节迭代( 和比较 )输入序列(好吧, char
by char
, 但这里有点混乱)。
因此,解决方案在于转换为 utf32 字符串(这对于 GCC 使用 std::wstring
是可能的,因为 wchar_t
是 32 位)。 Utf16 通常也应该 "work" 但它仍然存在我刚才概述的遍历问题,只是很少见。
现在,我创建了一个快捷的自定义 Finder 谓词:
template <typename CharT>
struct is_primcoll_equal
{
is_primcoll_equal(const std::locale& Loc=std::locale()) :
m_Loc(Loc), comp(Loc, boost::locale::collator_base::primary) {}
template< typename T1, typename T2 >
bool operator()(const T1& Arg1, const T2& Arg2) const {
// TODO use `do_compare` methods on the collation itself that
// don't construct basic_string<> instances
return 0 == comp(std::basic_string<CharT>(1, Arg1), std::basic_string<CharT>(1, Arg2));
}
private:
std::locale m_Loc;
boost::locale::comparator<CharT> comp;
};
它效率极低,因为它每次调用都构造单字符字符串。这是因为 do_compare
方法不是 collator<>
的 public API 的一部分。我离开派生自定义 collator<>
并将其用作 reader.
接下来,我们通过包装 find_format_all
来模仿 replace_all
接口:
template<typename SequenceT, typename Range1T, typename Range2T>
inline void collate_replace_all(
SequenceT& Input,
const Range1T& Search,
const Range2T& Format,
const std::locale& Loc=std::locale() )
{
::boost::algorithm::find_format_all(
Input,
::boost::algorithm::first_finder(Search, is_primcoll_equal<typename SequenceT::value_type>(Loc)),
::boost::algorithm::const_formatter(Format) );
}
现在我们只需要字符串加宽转换就可以了:
void FilterBadWords(std::string& s) {
using namespace boost::locale::conv;
std::wstring widened = utf_to_utf<wchar_t>(s, stop);
for (auto& badword : badwords) {
detail::collate_replace_all(widened, badword, L"*"/*, loc*/);
}
s = utf_to_utf<char>(widened, stop);
}
完整节目
#include <boost/algorithm/string/replace.hpp>
#include <boost/locale.hpp>
#include <iostream>
#include <locale>
#include <set>
#include <string>
const std::set<std::string> badwords =
{
"<not nice word>",
"<another not nice word>"
};
namespace detail {
template <typename CharT>
struct is_primcoll_equal
{
is_primcoll_equal(const std::locale& Loc=std::locale()) :
m_Loc(Loc), comp(Loc, boost::locale::collator_base::primary) {}
template< typename T1, typename T2 >
bool operator()(const T1& Arg1, const T2& Arg2) const {
// assert(0 == comp(L"<not nice word>", L"<nót Nićę wórd>"));
// TODO use `do_compare` methods on the collation itself that
// don't construct basic_string<> instances
return 0 == comp(std::basic_string<CharT>(1, Arg1), std::basic_string<CharT>(1, Arg2));
}
private:
std::locale m_Loc;
boost::locale::comparator<CharT> comp;
};
template<typename SequenceT, typename Range1T, typename Range2T>
inline void collate_replace_all(
SequenceT& Input,
const Range1T& Search,
const Range2T& Format,
const std::locale& Loc=std::locale() )
{
::boost::algorithm::find_format_all(
Input,
::boost::algorithm::first_finder(Search, is_primcoll_equal<typename SequenceT::value_type>(Loc)),
::boost::algorithm::const_formatter(Format) );
}
}
void FilterBadWords(std::string& s) {
using namespace boost::locale::conv;
std::wstring widened = utf_to_utf<wchar_t>(s, stop);
for (auto& badword : badwords) {
detail::collate_replace_all(widened, badword, L"*"/*, loc*/);
}
s = utf_to_utf<char>(widened, stop);
}
static_assert(sizeof(wchar_t) == sizeof(uint32_t), "Required for robustness (surrogate pairs, anyone?)");
int main()
{
auto loc = boost::locale::generator().generate("");
std::locale::global(loc);
std::string a("hello you <not nice word> person");
std::string b("hęlló you <nót Nićę wórd> person");
FilterBadWords(a);
FilterBadWords(b);
std::cout << a << "\n";
std::cout << b << "\n";
}
输出
在我的系统上:
hello you * person
hęlló you * person
¹ 显然 Coliru 执行环境中的语言环境支持不完整
作为使用较少提升的附加解决方案(好吧,您可以对其进行编辑以完全删除提升..):
const std::vector<std::string> badwords =
{
"badword1",
"badword2",
"badword3",
"badword4"
};
char PolishReplacement[0xFF];
const std::map<std::string, std::string> PolishReplacementMap =
{
{ "ł","l" },
{ "ą","a" },
{ "ę","e" },
{ "ć","c" },
{ "ż","z" },
{ "ź","z" },
{ "ó","o" },
{ "ś","s" },
{ "ń","n" },
{ "Ł","L" },
{ "Ą","A" },
{ "Ę","E" },
{ "Ć","C" },
{ "Ż","Z" },
{ "Ź","Z" },
{ "Ó","O" },
{ "Ś","S" },
{ "Ń","N" }
};
//preconstruct our array, we love speed gain by paying startup time
struct CPolishReplacementInitHack
{
CPolishReplacementInitHack()
{
for (unsigned char c = 0; c < 0xFF; ++c)
{
char tmpstr[2] = { c, 0 };
std::string tmpstdstr(tmpstr);
auto replacement = PolishReplacementMap.find(tmpstdstr);
if (replacement == PolishReplacementMap.end())
PolishReplacement[c] = boost::to_lower_copy(tmpstdstr)[0];
else
PolishReplacement[c] = boost::to_lower_copy(replacement->second)[0];
}
}
} _CPolishReplacementInitHack;
//actual filtering
void FilterBadWords(std::string& s)
{
std::string sc(s);
for (auto& character : sc)
character = PolishReplacement[(unsigned char)character];
for (auto &badword : badwords)
{
size_t pos = sc.find(badword);
size_t size = badword.size();
size_t possize;
while (pos != std::string::npos)
{
possize = pos + size;
s.replace ( s.begin() + pos, s.begin() + possize, "*");
sc.replace(sc.begin() + pos, sc.begin() + possize, "*");
pos = sc.find(badword);
}
}
}
这可能是不可移植的(Windows + Locale + encoding dependend?),但是 非常 快(200 毫秒/25000 个随机句子,i7,调试,没有优化)。