字符编码的自动分配?
Automatic Assignment of Character Encoding?
我对如何设置输出文件的编码感到困惑。
我有一个内容为“qwe”(每行一个字符)的测试文件。我已经测试了一些 ISO-x 编码。我读取文件并生成输出文件。但输出文件始终以 UTF-8 编码。这本身就令人困惑,因为我从未明确编写过代码来对输出文件进行 UTF-8 编码。更令人困惑的是,在另一个程序中,我将 UTF-8 作为输入并获得一些 ISO 编码作为输出......再次,没有我告诉他改变他的编码。
这是我的测试代码:
#include <iostream>
#include <fstream>
using namespace std;
int main(){
string in_file = "in.txt"; // some ISO encoding (e.g.)
ifstream in(in_file.c_str());
ofstream out;
out.open("out.txt");
while (in.good()) {
std::string line;
getline(in, line);
out << line << endl;
}
out.close(); // output file is in UTF-8
}
生成一些带有 UTF-8 输入的 ISO 的其他程序的代码很长,我找不到测试程序和我的实际程序之间的区别。但是也许理解了为什么测试程序以它的方式运行,已经使我能够找出另一个问题。
所以,基本上我的问题是,为什么输出文件设置为 UTF-8,或者是什么决定了 ofstream 对象的编码。
编辑:
好的,所以我让我的实际代码更方便一些,所以我现在可以更轻松地向您展示它。
因此,我得到了两个在表面级别运行的函数,它们从输入列表构造一个 trie,其中还包含为 graphviz 生成 DOT 代码的代码。
/*
*
* name: make_trie
* @param trie Trie to build
* @param type Type of tokens to build trie for
* @param gen_gv_code Determines wheter to generate graphviz-code
* (for debug and maintanance purposes)
* @return
*
*/
bool make_trie(Trie* trie, std::string type, bool gen_gv_code=false){
if (gen_gv_code){
gv_file
<< "digraph {\n\t"
<< "rankdir=LR;\n\t"
<< "node [shape = circle];\n\t"
<< "1 [label=1]\n\t"
<< "node [shape = point ]; Start\n\t"
<< "Start -> 1\n\t\t";
}
Element* current = wp_list->iterate();
state_symbol_tuple* sst;
std::string token = "<UNDEF>"; // token to add to trie
// once the last entry in the input list is encountered, make_trie()
// needs to run for as many times as that entry has letters +1 - the
// number of letters of taht stringa lready encoded into the trie to
// fully encode it into it.
bool last_token = false;
bool incr = false;
while (true){
if (type == "tag") { token = current->get_WPTuple_tag(); }
else if (type == "word") { token = current->get_WPTuple_word(); }
else {
cerr
<< "Error (trainer.h):"
<< "Unkown type '"
<< type
<< "'. Token has not been assigned."
<< endl;
abort();
}
// last_state is pointer to state the last transition in the trie
// that matched the string led to
sst = trie->find_state(token);
incr = trie->add(current, sst, gv_file, gen_gv_code);
// as soon as the last token has been encoded into the trie, break
if (last_token && sst->existing) { break; }
// go to the next list item only once the current one is represented
// in the trie
if (incr) {
// Once a word has been coded into the trie, go to the next word.
// Only iterate if you are not at the last elememt, otherwise
// you start at the front of the list again.
if (current->next != 0){
current = wp_list->iterate(); incr = false;
}
}
// enable the condition for the last token, as this is a boundary
// case
if (current->next == 0) { last_token = true;}
// free up memory allocated for current sst
delete sst;
}
if (gen_gv_code){
gv_file << "}";
gv_file.close();
}
return true;
}
/*
*
* name: Trie::add
* @details Encodes a given string into the trie. If the string is not
* in the trie yet, it needs to be passed to this function as many
* times as it has letters +1.
* @param current list element
* @param sst state_symbol_tuple containing information on the last
* state that represents the string to be encoded up to some point.
* Also contains the string itself.
* @return returns boolean, true if token is already represented
* in trie, false else
*
*/
bool Trie::add(Element* current, state_symbol_tuple* sst, \
std::ofstream &gv_file_local, bool gen_gv_code){
if (current != 0){
// if the word is represented in the trie, increment its counter
// and go to the next word in the list
if (sst->existing){
(((sst->state)->find_transition(sst->symbol))->get_successor())->increment_occurance();
if (gen_gv_code){
gv_file_local
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
<< "[shape = ellipse label = \""
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
<< "\nocc: "
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_occurance()
//~ << "\naddr: "
//~ << ((sst->state)->find_transition(sst->symbol))->get_successor()
<< "\" peripheries=2]\n\t\t";
}
return true;
}
// if the current string is a substring of one already enoced into
// the trie, make the substring an accepted one too
else if (sst->is_substring){
(((sst->state)->find_transition(sst->symbol))->get_successor()) \
->make_accepting();
}
// if the word isn't represented in the trie, make a transition
// for the first character of the word that wasn't represented
// and then look for the word anew, until it *is* represented.
else {
(sst->state)->append_back(sst->symbol);
// as the new transition has been appended at the back
// "last" is that new transition
// make an empty successor state that the new transition
// points to
((sst->state)->get_last())->make_successor();
// increment total state count
increment_states_total();
// give the newly added state a unique ID, by making its ID
// the current number of states
(((sst->state)->get_last())->get_successor())->set_id(get_states_total());
if (gen_gv_code){
gv_file_local << (sst->state)->get_id() << " -> " << get_states_total()
<< "[label=\"";
if (sst->symbol == '"') {
gv_file_local << "#";
}
else{
gv_file_local << sst->symbol;
}
gv_file_local << "\"]\n\t\t";
}
get_states_total();
// if the length of the input string -1 is equal to the
// index of the last symbol, that was processed, then that
// was the last symbol of the string and the new state needs
// to become an accepting one
if (sst->index == (sst->str_len-1)){
// access the newly created successor state
// define it as an accepting state
(((sst->state)->get_last())->get_successor())->make_accepting();
}
else if (gen_gv_code){
gv_file_local
<< get_states_total()
<< "[shape = circle label = \""
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
//~ << "\naddr: "
//~ << ((sst->state)->find_transition(sst->symbol))->get_successor()
<< "\"]\n\t\t";
}
}
} else { cerr << "list to build trie from is empty" << endl; abort();}
return false;
}
输出文件打开如下:
gv_file.open("gv_file");
然后像这样传递:
make_trie(trie_words, "word", true);
因为这是关于编码的问题,实现的细节无关紧要,重要的是 DOT 代码写入输出文件的位。
我的测试输入是这样的(UTF-8):
ascii-range
ütf-8-ränge
我的输出是这样的(在 ISO-8859 中)
digraph {
rankdir=LR;
node [shape = circle];
1 [label=1]
node [shape = point ]; Start
Start -> 1
1 -> 2[label="a"]
2[shape = circle label = "2"]
2 -> 3[label="s"]
3[shape = circle label = "3"]
3 -> 4[label="c"]
4[shape = circle label = "4"]
4 -> 5[label="i"]
5[shape = circle label = "5"]
5 -> 6[label="i"]
6[shape = circle label = "6"]
6 -> 7[label="-"]
7[shape = circle label = "7"]
7 -> 8[label="r"]
8[shape = circle label = "8"]
8 -> 9[label="a"]
9[shape = circle label = "9"]
9 -> 10[label="n"]
10[shape = circle label = "10"]
10 -> 11[label="g"]
11[shape = circle label = "11"]
11 -> 12[label="e"]
12[shape = ellipse label = "12
occ: 1" peripheries=2]
1 -> 13[label="Ã"]
13[shape = circle label = "13"]
13 -> 14[label="Œ"]
14[shape = circle label = "14"]
14 -> 15[label="t"]
15[shape = circle label = "15"]
15 -> 16[label="f"]
16[shape = circle label = "16"]
16 -> 17[label="-"]
17[shape = circle label = "17"]
17 -> 18[label="8"]
18[shape = circle label = "18"]
18 -> 19[label="-"]
19[shape = circle label = "19"]
19 -> 20[label="r"]
20[shape = circle label = "20"]
20 -> 21[label="Ã"]
21[shape = circle label = "21"]
21 -> 22[label="€"]
22[shape = circle label = "22"]
22 -> 23[label="n"]
23[shape = circle label = "23"]
23 -> 24[label="g"]
24[shape = circle label = "24"]
24 -> 25[label="e"]
25[shape = ellipse label = "25
occ: 1" peripheries=2]
}
所以是的...我如何确保我的输出也以 utf8 编码?
在 UTF-8 中,一些字符被编码为多个字节。例如 ä
需要两个字节来编码。您读取字符串的代码完全忽略了这一点并假设每个字符一个字节。然后您将分别输出字节;这不是合法的 UTF-8,因此无论您使用什么来计算字符集,都可以推断出它必须是 ISO-8859。
具体来说,ISO-8859编码的Ã
和€
两个字符与UTF-8编码ä
的2个字节完全相同
如果像我之前建议的那样,您查看原始字节,这会更加明显。
我对如何设置输出文件的编码感到困惑。
我有一个内容为“qwe”(每行一个字符)的测试文件。我已经测试了一些 ISO-x 编码。我读取文件并生成输出文件。但输出文件始终以 UTF-8 编码。这本身就令人困惑,因为我从未明确编写过代码来对输出文件进行 UTF-8 编码。更令人困惑的是,在另一个程序中,我将 UTF-8 作为输入并获得一些 ISO 编码作为输出......再次,没有我告诉他改变他的编码。
这是我的测试代码:
#include <iostream>
#include <fstream>
using namespace std;
int main(){
string in_file = "in.txt"; // some ISO encoding (e.g.)
ifstream in(in_file.c_str());
ofstream out;
out.open("out.txt");
while (in.good()) {
std::string line;
getline(in, line);
out << line << endl;
}
out.close(); // output file is in UTF-8
}
生成一些带有 UTF-8 输入的 ISO 的其他程序的代码很长,我找不到测试程序和我的实际程序之间的区别。但是也许理解了为什么测试程序以它的方式运行,已经使我能够找出另一个问题。
所以,基本上我的问题是,为什么输出文件设置为 UTF-8,或者是什么决定了 ofstream 对象的编码。
编辑:
好的,所以我让我的实际代码更方便一些,所以我现在可以更轻松地向您展示它。
因此,我得到了两个在表面级别运行的函数,它们从输入列表构造一个 trie,其中还包含为 graphviz 生成 DOT 代码的代码。
/*
*
* name: make_trie
* @param trie Trie to build
* @param type Type of tokens to build trie for
* @param gen_gv_code Determines wheter to generate graphviz-code
* (for debug and maintanance purposes)
* @return
*
*/
bool make_trie(Trie* trie, std::string type, bool gen_gv_code=false){
if (gen_gv_code){
gv_file
<< "digraph {\n\t"
<< "rankdir=LR;\n\t"
<< "node [shape = circle];\n\t"
<< "1 [label=1]\n\t"
<< "node [shape = point ]; Start\n\t"
<< "Start -> 1\n\t\t";
}
Element* current = wp_list->iterate();
state_symbol_tuple* sst;
std::string token = "<UNDEF>"; // token to add to trie
// once the last entry in the input list is encountered, make_trie()
// needs to run for as many times as that entry has letters +1 - the
// number of letters of taht stringa lready encoded into the trie to
// fully encode it into it.
bool last_token = false;
bool incr = false;
while (true){
if (type == "tag") { token = current->get_WPTuple_tag(); }
else if (type == "word") { token = current->get_WPTuple_word(); }
else {
cerr
<< "Error (trainer.h):"
<< "Unkown type '"
<< type
<< "'. Token has not been assigned."
<< endl;
abort();
}
// last_state is pointer to state the last transition in the trie
// that matched the string led to
sst = trie->find_state(token);
incr = trie->add(current, sst, gv_file, gen_gv_code);
// as soon as the last token has been encoded into the trie, break
if (last_token && sst->existing) { break; }
// go to the next list item only once the current one is represented
// in the trie
if (incr) {
// Once a word has been coded into the trie, go to the next word.
// Only iterate if you are not at the last elememt, otherwise
// you start at the front of the list again.
if (current->next != 0){
current = wp_list->iterate(); incr = false;
}
}
// enable the condition for the last token, as this is a boundary
// case
if (current->next == 0) { last_token = true;}
// free up memory allocated for current sst
delete sst;
}
if (gen_gv_code){
gv_file << "}";
gv_file.close();
}
return true;
}
/*
*
* name: Trie::add
* @details Encodes a given string into the trie. If the string is not
* in the trie yet, it needs to be passed to this function as many
* times as it has letters +1.
* @param current list element
* @param sst state_symbol_tuple containing information on the last
* state that represents the string to be encoded up to some point.
* Also contains the string itself.
* @return returns boolean, true if token is already represented
* in trie, false else
*
*/
bool Trie::add(Element* current, state_symbol_tuple* sst, \
std::ofstream &gv_file_local, bool gen_gv_code){
if (current != 0){
// if the word is represented in the trie, increment its counter
// and go to the next word in the list
if (sst->existing){
(((sst->state)->find_transition(sst->symbol))->get_successor())->increment_occurance();
if (gen_gv_code){
gv_file_local
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
<< "[shape = ellipse label = \""
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
<< "\nocc: "
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_occurance()
//~ << "\naddr: "
//~ << ((sst->state)->find_transition(sst->symbol))->get_successor()
<< "\" peripheries=2]\n\t\t";
}
return true;
}
// if the current string is a substring of one already enoced into
// the trie, make the substring an accepted one too
else if (sst->is_substring){
(((sst->state)->find_transition(sst->symbol))->get_successor()) \
->make_accepting();
}
// if the word isn't represented in the trie, make a transition
// for the first character of the word that wasn't represented
// and then look for the word anew, until it *is* represented.
else {
(sst->state)->append_back(sst->symbol);
// as the new transition has been appended at the back
// "last" is that new transition
// make an empty successor state that the new transition
// points to
((sst->state)->get_last())->make_successor();
// increment total state count
increment_states_total();
// give the newly added state a unique ID, by making its ID
// the current number of states
(((sst->state)->get_last())->get_successor())->set_id(get_states_total());
if (gen_gv_code){
gv_file_local << (sst->state)->get_id() << " -> " << get_states_total()
<< "[label=\"";
if (sst->symbol == '"') {
gv_file_local << "#";
}
else{
gv_file_local << sst->symbol;
}
gv_file_local << "\"]\n\t\t";
}
get_states_total();
// if the length of the input string -1 is equal to the
// index of the last symbol, that was processed, then that
// was the last symbol of the string and the new state needs
// to become an accepting one
if (sst->index == (sst->str_len-1)){
// access the newly created successor state
// define it as an accepting state
(((sst->state)->get_last())->get_successor())->make_accepting();
}
else if (gen_gv_code){
gv_file_local
<< get_states_total()
<< "[shape = circle label = \""
<< (((sst->state)->find_transition(sst->symbol))->get_successor())->get_id()
//~ << "\naddr: "
//~ << ((sst->state)->find_transition(sst->symbol))->get_successor()
<< "\"]\n\t\t";
}
}
} else { cerr << "list to build trie from is empty" << endl; abort();}
return false;
}
输出文件打开如下:
gv_file.open("gv_file");
然后像这样传递:
make_trie(trie_words, "word", true);
因为这是关于编码的问题,实现的细节无关紧要,重要的是 DOT 代码写入输出文件的位。
我的测试输入是这样的(UTF-8):
ascii-range
ütf-8-ränge
我的输出是这样的(在 ISO-8859 中)
digraph {
rankdir=LR;
node [shape = circle];
1 [label=1]
node [shape = point ]; Start
Start -> 1
1 -> 2[label="a"]
2[shape = circle label = "2"]
2 -> 3[label="s"]
3[shape = circle label = "3"]
3 -> 4[label="c"]
4[shape = circle label = "4"]
4 -> 5[label="i"]
5[shape = circle label = "5"]
5 -> 6[label="i"]
6[shape = circle label = "6"]
6 -> 7[label="-"]
7[shape = circle label = "7"]
7 -> 8[label="r"]
8[shape = circle label = "8"]
8 -> 9[label="a"]
9[shape = circle label = "9"]
9 -> 10[label="n"]
10[shape = circle label = "10"]
10 -> 11[label="g"]
11[shape = circle label = "11"]
11 -> 12[label="e"]
12[shape = ellipse label = "12
occ: 1" peripheries=2]
1 -> 13[label="Ã"]
13[shape = circle label = "13"]
13 -> 14[label="Œ"]
14[shape = circle label = "14"]
14 -> 15[label="t"]
15[shape = circle label = "15"]
15 -> 16[label="f"]
16[shape = circle label = "16"]
16 -> 17[label="-"]
17[shape = circle label = "17"]
17 -> 18[label="8"]
18[shape = circle label = "18"]
18 -> 19[label="-"]
19[shape = circle label = "19"]
19 -> 20[label="r"]
20[shape = circle label = "20"]
20 -> 21[label="Ã"]
21[shape = circle label = "21"]
21 -> 22[label="€"]
22[shape = circle label = "22"]
22 -> 23[label="n"]
23[shape = circle label = "23"]
23 -> 24[label="g"]
24[shape = circle label = "24"]
24 -> 25[label="e"]
25[shape = ellipse label = "25
occ: 1" peripheries=2]
}
所以是的...我如何确保我的输出也以 utf8 编码?
在 UTF-8 中,一些字符被编码为多个字节。例如 ä
需要两个字节来编码。您读取字符串的代码完全忽略了这一点并假设每个字符一个字节。然后您将分别输出字节;这不是合法的 UTF-8,因此无论您使用什么来计算字符集,都可以推断出它必须是 ISO-8859。
具体来说,ISO-8859编码的Ã
和€
两个字符与UTF-8编码ä
的2个字节完全相同
如果像我之前建议的那样,您查看原始字节,这会更加明显。