Boost locale normalize 去除字符但没有重音
Boost locale normalize strips characters but no accents
我正在尝试使用 boost 本地库去除字符串中的重音符号。
规范化函数删除了整个带重音的字符,我只想删除重音。
è -> e 例如
这是我的代码
std::string hello(u8"élève");
boost::locale::generator gen;
std::string str = boost::locale::normalize(hello,boost::locale::norm_nfd,gen(""));
期望的输出:eleve
我的输出:lve
请帮忙
规范化不是这样做的。使用 nfd
它进行“规范分解”。然后您需要删除组合字符代码点。
UPDATE 添加松散的实现,从 the utf8 tables 中收集大多数组合字符似乎以 0xcc 或 0xcd 开头:
// also liable to strip some greek characters that lead with 0xcd
template <typename Str>
static Str try_strip_diacritics(
Str const& input,
std::locale const& loc = std::locale())
{
using Ch = typename Str::value_type;
using T = boost::locale::utf::utf_traits<Ch>;
auto tmp = boost::locale::normalize(
input, boost::locale::norm_nfd, loc);
auto f = tmp.begin(), l = tmp.end(), out = f;
while (f!=l) {
switch(*f) {
case '\xcc':
case '\xcd': // TODO find more
T::decode(f, l);
break; // skip
default:
out = T::encode(T::decode(f, l), out);
break;
}
}
tmp.erase(out, l);
return tmp;
}
打印(在我的盒子上!):
Before: "élève" 0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
all-in-one: "eleve" 0x65 0x6c 0x65 0x76 0x65
较早的回答text/analysis:
#include <boost/locale.hpp>
#include <iomanip>
#include <iostream>
static void dump(std::string const& s) {
std::cout << std::hex << std::showbase << std::setfill('0');
for (uint8_t ch : s)
std::cout << " " << std::setw(4) << int(ch);
std::cout << std::endl;
}
int main() {
boost::locale::generator gen;
std::string const pupil(u8"élève");
std::string const str =
boost::locale::normalize(
pupil,
boost::locale::norm_nfd,
gen(""));
std::cout << "Before: "; dump(pupil);
std::cout << "After: "; dump(str);
}
打印,在我的盒子上:
Before: 0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
After: 0x65 0xcc 0x81 0x6c 0x65 0xcc 0x80 0x76 0x65
However, on Coliru it makes no difference. This indicates that it depends on the available/system locales.
Unicode normalization is the process of converting strings to a
standard form, suitable for text processing and comparison. For
example, character "ü" can be represented by a single code point or a
combination of the character "u" and the diaeresis "¨". Normalization
is an important part of Unicode text processing.
Unicode defines four normalization forms. Each specific form is
selected by a flag passed to normalize function:
- NFD - Canonical decomposition -
boost::locale::norm_nfd
- NFC - Canonical decomposition followed by canonical composition - boost::locale::norm_nfc or
boost::locale::norm_default
- NFKD - Compatibility decomposition -
boost::locale::norm_nfkd
- NFKC - Compatibility decomposition followed by canonical composition -
boost::locale::norm_nfkc
For more details on normalization forms, read [this article][1].
你能做什么
看来您可能会通过仅执行 分解 (因此 NFD)然后删除任何非 alpha 代码点来获得一些方法。
这是作弊,因为它假设所有代码点都是单一单位,这通常不是真的,但对于示例它确实有效:
查看上面的改进版本,它迭代代码点而不是字节。
我正在尝试使用 boost 本地库去除字符串中的重音符号。
规范化函数删除了整个带重音的字符,我只想删除重音。
è -> e 例如
这是我的代码
std::string hello(u8"élève");
boost::locale::generator gen;
std::string str = boost::locale::normalize(hello,boost::locale::norm_nfd,gen(""));
期望的输出:eleve
我的输出:lve
请帮忙
规范化不是这样做的。使用 nfd
它进行“规范分解”。然后您需要删除组合字符代码点。
UPDATE 添加松散的实现,从 the utf8 tables 中收集大多数组合字符似乎以 0xcc 或 0xcd 开头:
// also liable to strip some greek characters that lead with 0xcd
template <typename Str>
static Str try_strip_diacritics(
Str const& input,
std::locale const& loc = std::locale())
{
using Ch = typename Str::value_type;
using T = boost::locale::utf::utf_traits<Ch>;
auto tmp = boost::locale::normalize(
input, boost::locale::norm_nfd, loc);
auto f = tmp.begin(), l = tmp.end(), out = f;
while (f!=l) {
switch(*f) {
case '\xcc':
case '\xcd': // TODO find more
T::decode(f, l);
break; // skip
default:
out = T::encode(T::decode(f, l), out);
break;
}
}
tmp.erase(out, l);
return tmp;
}
打印(在我的盒子上!):
Before: "élève" 0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
all-in-one: "eleve" 0x65 0x6c 0x65 0x76 0x65
较早的回答text/analysis:
#include <boost/locale.hpp>
#include <iomanip>
#include <iostream>
static void dump(std::string const& s) {
std::cout << std::hex << std::showbase << std::setfill('0');
for (uint8_t ch : s)
std::cout << " " << std::setw(4) << int(ch);
std::cout << std::endl;
}
int main() {
boost::locale::generator gen;
std::string const pupil(u8"élève");
std::string const str =
boost::locale::normalize(
pupil,
boost::locale::norm_nfd,
gen(""));
std::cout << "Before: "; dump(pupil);
std::cout << "After: "; dump(str);
}
打印,在我的盒子上:
Before: 0xc3 0xa9 0x6c 0xc3 0xa8 0x76 0x65
After: 0x65 0xcc 0x81 0x6c 0x65 0xcc 0x80 0x76 0x65
However, on Coliru it makes no difference. This indicates that it depends on the available/system locales.
Unicode normalization is the process of converting strings to a standard form, suitable for text processing and comparison. For example, character "ü" can be represented by a single code point or a combination of the character "u" and the diaeresis "¨". Normalization is an important part of Unicode text processing.
Unicode defines four normalization forms. Each specific form is selected by a flag passed to normalize function:
- NFD - Canonical decomposition -
boost::locale::norm_nfd
- NFC - Canonical decomposition followed by canonical composition - boost::locale::norm_nfc or
boost::locale::norm_default
- NFKD - Compatibility decomposition -
boost::locale::norm_nfkd
- NFKC - Compatibility decomposition followed by canonical composition -
boost::locale::norm_nfkc
For more details on normalization forms, read [this article][1].
你能做什么
看来您可能会通过仅执行 分解 (因此 NFD)然后删除任何非 alpha 代码点来获得一些方法。
这是作弊,因为它假设所有代码点都是单一单位,这通常不是真的,但对于示例它确实有效:
查看上面的改进版本,它迭代代码点而不是字节。