如何在 C++ 中处理 ifstream、cout 等的多个语言环境
How to handle multiple locales for ifstream, cout, etc, in c++
我正在尝试读取和处理多个采用不同编码的文件。我应该只为此使用 STL。
假设我们有 iso-8859-15 和 UTF-8 文件。
在 中这样回答:
In a nutshell the more interesting part for you:
std::stream
(stringstream
, fstream
, cin
, cout
) has an inner
locale-object, which matches the value of the global C++ locale at
the moment of the creation of the stream object. As std::in
is
created long before your code in main is called, it has most
probably the classical C locale, no matter what you do afterwards.
- You can make sure, that a std::stream object has the desirable
locale by invoking
std::stream::imbue(std::locale(your_favorite_locale))
.
问题在于,在这两种类型中,只有与最先创建的语言环境相匹配的文件才能得到正确处理。例如,如果 locale_DE_ISO885915
在 locale_DE_UTF8
之前,则 UTF-8
中的文件不会正确附加到 string s
中,当我 cout
它们出来时,我只看到几个文件中的行。
void processFiles() {
//setup locales for file decoding
std::locale locale_DE_ISO885915("de_DE.iso885915@euro");
std::locale locale_DE_UTF8("de_DE.UTF-8");
//std::locale::global(locale_DE_ISO885915);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
//std::locale::global(locale_DE_UTF8);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string currFile, fileStr;
std::wifstream inFile;
std::wstring s;
for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
currFile = *fci;
//check file and set locale
if (currFile.find("-8.txt") != std::string::npos) {
std::locale::global(locale_DE_ISO885915);
std::cout.imbue(locale_DE_ISO885915);
}
else {
std::locale::global(locale_DE_UTF8);
std::cout.imbue(locale_DE_UTF8);
}
inFile.open(path + currFile, std::ios_base::binary);
if (!inFile) {
//TODO specific file report
std::cerr << "Failed to open file " << *fci << std::endl;
exit(1);
}
s.clear();
//read file content
std::wstring line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + L"\n");
}
inFile.close();
//remove punctuation, numbers, tolower...
for (unsigned int i = 0; i < s.length(); ++i) {
if (ispunct(s[i]) || isdigit(s[i]))
s[i] = L' ';
}
if (currFile.find("-8.txt") != std::string::npos) {
facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
}
else {
facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
}
fileStr = converter.to_bytes(s);
std::cout << fileStr << std::endl;
std::cout << currFile << std::endl;
std::cout << fileStr.size() << std::endl;
std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
std::cout << "========================================================================================" << std::endl;
// Process...
}
return;
}
如您在代码中所见,我已尝试使用 global
和 locale local variables
但无济于事。
此外,在 How can I use std::imbue to set the locale for std::wcout? SO 回答中指出:
So it really looks like there was an underlying C library mechanizme
that should be first enabled with setlocale to allow imbue conversion
to work correctly.
这是"obscure"机制的问题吗?
是否可以在处理文件时在两种语言环境之间切换? 我应该注入什么(cout
、ifstream
、getline
?)以及如何注入?
有什么建议吗?
PS: 为什么跟locale有关的东西都这么乱? :|
这在我的 Linux 机器上对我有效,但在 Cygwin 下的 Windows 机器上不起作用(两台机器上的可用语言环境集显然相同,但 std::locale::locale
对每个可以想象的语言环境字符串都失败了)。
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE@euro"); // iso8859-15 text: grüßen
}
输出:
grüßen
grüßen
我正在尝试读取和处理多个采用不同编码的文件。我应该只为此使用 STL。 假设我们有 iso-8859-15 和 UTF-8 文件。
在
In a nutshell the more interesting part for you:
std::stream
(stringstream
,fstream
,cin
,cout
) has an inner locale-object, which matches the value of the global C++ locale at the moment of the creation of the stream object. Asstd::in
is created long before your code in main is called, it has most probably the classical C locale, no matter what you do afterwards.- You can make sure, that a std::stream object has the desirable locale by invoking
std::stream::imbue(std::locale(your_favorite_locale))
.
问题在于,在这两种类型中,只有与最先创建的语言环境相匹配的文件才能得到正确处理。例如,如果 locale_DE_ISO885915
在 locale_DE_UTF8
之前,则 UTF-8
中的文件不会正确附加到 string s
中,当我 cout
它们出来时,我只看到几个文件中的行。
void processFiles() {
//setup locales for file decoding
std::locale locale_DE_ISO885915("de_DE.iso885915@euro");
std::locale locale_DE_UTF8("de_DE.UTF-8");
//std::locale::global(locale_DE_ISO885915);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_ISO885915 = std::use_facet<std::ctype<wchar_t>>(locale_DE_ISO885915);
//std::locale::global(locale_DE_UTF8);
//std::cout.imbue(std::locale());
const std::ctype<wchar_t>& facet_DE_UTF8 = std::use_facet<std::ctype<wchar_t>>(locale_DE_UTF8);
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::string currFile, fileStr;
std::wifstream inFile;
std::wstring s;
for (std::vector<std::string>::const_iterator fci = files.begin(); fci != files.end(); ++fci) {
currFile = *fci;
//check file and set locale
if (currFile.find("-8.txt") != std::string::npos) {
std::locale::global(locale_DE_ISO885915);
std::cout.imbue(locale_DE_ISO885915);
}
else {
std::locale::global(locale_DE_UTF8);
std::cout.imbue(locale_DE_UTF8);
}
inFile.open(path + currFile, std::ios_base::binary);
if (!inFile) {
//TODO specific file report
std::cerr << "Failed to open file " << *fci << std::endl;
exit(1);
}
s.clear();
//read file content
std::wstring line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + L"\n");
}
inFile.close();
//remove punctuation, numbers, tolower...
for (unsigned int i = 0; i < s.length(); ++i) {
if (ispunct(s[i]) || isdigit(s[i]))
s[i] = L' ';
}
if (currFile.find("-8.txt") != std::string::npos) {
facet_DE_ISO885915.tolower(&s[0], &s[0] + s.size());
}
else {
facet_DE_UTF8.tolower(&s[0], &s[0] + s.size());
}
fileStr = converter.to_bytes(s);
std::cout << fileStr << std::endl;
std::cout << currFile << std::endl;
std::cout << fileStr.size() << std::endl;
std::cout << std::setlocale(LC_ALL, NULL) << std::endl;
std::cout << "========================================================================================" << std::endl;
// Process...
}
return;
}
如您在代码中所见,我已尝试使用 global
和 locale local variables
但无济于事。
此外,在 How can I use std::imbue to set the locale for std::wcout? SO 回答中指出:
So it really looks like there was an underlying C library mechanizme that should be first enabled with setlocale to allow imbue conversion to work correctly.
这是"obscure"机制的问题吗?
是否可以在处理文件时在两种语言环境之间切换? 我应该注入什么(cout
、ifstream
、getline
?)以及如何注入?
有什么建议吗?
PS: 为什么跟locale有关的东西都这么乱? :|
这在我的 Linux 机器上对我有效,但在 Cygwin 下的 Windows 机器上不起作用(两台机器上的可用语言环境集显然相同,但 std::locale::locale
对每个可以想象的语言环境字符串都失败了)。
#include <iostream>
#include <fstream>
#include <locale>
#include <string>
void printFile(const char* name, const char* loc)
{
try {
std::wifstream inFile;
inFile.imbue(std::locale(loc));
inFile.open(name);
std::wstring line;
while (getline(inFile, line))
std::wcout << line << '\n';
} catch (std::exception& e) {
std::cerr << e.what() << std::endl;
}
}
int main()
{
std::locale::global(std::locale("en_US.utf8"));
printFile ("gtext-u8.txt", "de_DE.utf8"); // utf-8 text: grüßen
printFile ("gtext-legacy.txt", "de_DE@euro"); // iso8859-15 text: grüßen
}
输出:
grüßen
grüßen