对象的 C++ memcpy 副本出现损坏
C++ memcpy copy of object appears corrupted
作为实现编译器的 class 项目的一部分,我还实现了一个散列 table 用作编译器的符号 table。
散列table的实现旨在非常低级,手动打包的原始内存块,并且仅用于存储Token对象。因此,为了优化哈希 table 的可序列化性,我决定简单地内联 table 中的令牌,也就是说,当一个令牌时,简单地将令牌对象 memcpy 到 table 的内存中第一次插入。
我知道不应该 memcpy a class that has virtual functions or pointers, and that in general using memcpy on objects of a class is bad practice。但是,从下面的声明中可以看出,Token class 没有虚函数或指针,如果这不是低级编程练习,我不会使用 memcpy。
class Token {
public:
Token() : tag(BAD) {}
Token(Tag t) : tag(t) {}
Tag tag;
size_t getSize(){ return sizeof(Token); }
};
我遇到的问题是散列 table 正确插入了令牌,并且在查找相同的键时它找到了相同的内存位置,但是,当我尝试访问的成员时返回的Token指针,看来数据已经损坏了。
我编写了一个程序来测试符号 table 的简单输入。该程序执行以下操作:
- 将输入文件读入缓冲区。
- 通过将所有内置令牌插入 Lexer 符号来初始化 Lexer table。
- 在输入时调用 Lexer 的 getToken 方法并打印令牌的标签,直到读取 EOF 令牌。
虽然符号 table returns 与插入令牌的内存位置相同,但令牌的标签属性不再与插入的原始标签属性匹配。下面是程序在符号table:
中插入关键字program时的日志输出
[debug] Entering SymbolTable.insert(const char* key)
[debug] bin: 48 Searching for the last element in this bin's linked list.
[debug] Last entry found... writing to entry's next location field.
[debug] Writing the new entry's identifier to the table.
[debug] The identifier: program has been written to the table.
[debug] The memory blocks are not equal prior to assignment.
[debug] The memory blocks are equal.
[debug] nextLoc: 571 tag: 46
[debug] Location of Token: 14287688
下面是程序随后在符号table.
中查找标识符程序时的日志输出
[debug] Entering Lexer.getToken()
[debug] Entering SymbolTable.contains(const char* key)
[debug] Entering SymbolTable.find(const char* key) key: program
[debug] bin: 48
[debug] Search location: 541
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: a
[debug] Tag of entry: 1684368227
[debug] Location of Token: 14287653
[debug] Search location: 557
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: o to table char: o
[debug] Comparing key char: g to table char: c
[debug] Tag of entry: 1920296037
[debug] Location of Token: 14287668
[debug] Search location: 0
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: o to table char: o
[debug] Comparing key char: g to table char: g
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: a to table char: a
[debug] Comparing key char: m to table char: m
[debug] Tag of entry: 1207959598
[debug] Location of Token: 14287688
The 1th token: 60
所以从Location of Token消息可以看出,符号table在内存中找到了它写入Token的相同位置,但是Token的标签不同。我很困惑为什么会这样。
为了完整起见,这里是 SymbolTable 的定义 class。
template<class sizeType>
class SymbolTable{
public:
SymbolTable();
~SymbolTable();
Token* operator[](const char* key);
bool contains(const char* key);
Token* insert(const char* key, Token value);
private:
void* find(const char* key);
static const size_t nbins = 64;
sizeType nextLoc;
void* table;
size_t hash(const char* key){
return (size_t)key[0] % nbins;
}
};
这里是符号 table 的插入、查找和运算符 [] 函数的源代码。
template<class sizeType> Token* SymbolTable<sizeType>::insert(const char* key, Token value){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.insert(const char* key)";
size_t bin = hash(key);
void *sizeType_ptr,
*tbl_char_ptr,
*idSize_ptr;
unsigned char idSize = 0;
const char *key_ptr = key;
Token *token_ptr = NULL;
// Find the last entry in this bin's linked list.
BOOST_LOG_TRIVIAL(debug) << "bin: " << bin
<< " Searching for the last element in this bin's linked list.";
sizeType_ptr = table + sizeof(sizeType)*bin;
while(*((sizeType*)sizeType_ptr) != 0){
sizeType_ptr = table + *((sizeType*)sizeType_ptr);
}
BOOST_LOG_TRIVIAL(debug) << "Last entry found... writing to entry's next location field.";
// Write the location of the new entry to this entry's next field.
*((sizeType*)sizeType_ptr) = nextLoc;
// Move to new entry's location.
sizeType_ptr = table + nextLoc;
// Write identifier
BOOST_LOG_TRIVIAL(debug) << "Writing the new entry's identifier to the table.";
tbl_char_ptr = sizeType_ptr + sizeof(sizeType) + sizeof(unsigned char);
while(*key_ptr != '[=14=]'){
*((char*)tbl_char_ptr) = *key_ptr;
tbl_char_ptr = tbl_char_ptr + sizeof(char);
++key_ptr;
++idSize;
}
BOOST_LOG_TRIVIAL(debug) << "The identifier: " << key << " has been written to the table.";
// Write length of identifier.
idSize_ptr = sizeType_ptr + sizeof(sizeType);
*((unsigned char*)idSize_ptr) = idSize;
token_ptr = (Token*)(tbl_char_ptr + sizeof(char));
void *tk = &value,
*tb = token_ptr;
for(int i = 0; i < value.getSize(); ++i){
if(*((char*)tk) != *((char*)tb)){
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not equal prior to assignment.";
break;
}
}
memcpy((void*)token_ptr, &value, value.getSize());
bool areEqual = true;
for(int i = 0; i < value.getSize(); ++i){
if(*((char*)tk) != *((char*)tb)){
areEqual = false;
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not after assignment.";
break;
}
}
if(areEqual){
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are equal.";
}
nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
+ idSize*sizeof(char) + value.getSize();
BOOST_LOG_TRIVIAL(debug) << "nextLoc: " << nextLoc
<< " tag: " << token_ptr->tag;
BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)token_ptr;
return token_ptr;
}
template<class sizeType>
void* SymbolTable<sizeType>::find(const char* key){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.find(const char* key) "
<< "key: " << key;
bool found = false;
size_t bin = hash(key);
void *idSize,
*sizeType_ptr,
*tbl_char_ptr,
*result_ptr = NULL;
const char* key_ptr = key;
BOOST_LOG_TRIVIAL(debug) << "bin: " << bin;
sizeType_ptr = table + sizeof(sizeType)*bin;
while(!found){
found = true;
// Is this the last element in this bin?
if(*((sizeType*)sizeType_ptr) == 0){
result_ptr = NULL;
return result_ptr;
}
// Advance to the next element in this bin's linked list.
sizeType_ptr = table + *((sizeType*)sizeType_ptr);
idSize = sizeType_ptr + sizeof(sizeType);
tbl_char_ptr = idSize + sizeof(unsigned char);
BOOST_LOG_TRIVIAL(debug) << "Search location: " << *((sizeType*)sizeType_ptr);
// Check if the passed key matches the current key in the table.
for(int i = 0; i < *((unsigned char*)idSize); ++i){
BOOST_LOG_TRIVIAL(debug) << "Comparing key char: " << *key_ptr
<< "to table char: " << *((const char*)tbl_char_ptr);
// Check if the key is too short or if the chars do not match.
if(*key_ptr != *((const char*)tbl_char_ptr)){
found = false;
break;
}
++key_ptr;
tbl_char_ptr = tbl_char_ptr + sizeof(char);
BOOST_LOG_TRIVIAL(debug) << "*(const char*)tbl_char_ptr: "
<< *((const char*)tbl_char_ptr);
}
result_ptr = tbl_char_ptr + sizeof(char);
BOOST_LOG_TRIVIAL(debug) << "Tag of entry: " << ((Token*)result_ptr)->tag;
BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)result_ptr;
key_ptr = key;
}
return result_ptr;
}
template<class sizeType>
Token* SymbolTable<sizeType>::operator[](const char* key){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.operator[](const char* key)";
void* void_ptr = find(key);
Token* token_ptr = (Token*)void_ptr;
return token_ptr;
}
这里是测试程序的源代码。
int main(){
cout << "Executing testLexer.cpp" << endl;
ifstream file("./pascalPrograms/helloworld.pas");
string program((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
cout << "program:\n\n" << program << endl;
int fileSize = program.length();
const char* buffer = program.c_str();
const char* scanp = buffer;
cout << "Instantiating Lexer" << endl << endl;
Lexer lexer = Lexer(scanp);
Token* tok;
int i = 0;
cout << "Entering while loop to fetch tags." << endl << endl;
do{
i++;
tok = lexer.getToken();
cout << "The " << i << "th token: " << tok->tag << endl;
} while(tok->tag != END_OF_FILE);
return 0;
}
提前感谢您的帮助! :D
编辑:
这里是输入的Pascal程序:
program Hello;
begin
writeln ('Hello, world.');
readln
end.
并澄清问题:
当符号table中的Token为原件的精确副本?
找到了。您正在用 'H' 覆盖标签的第一个字节。其他字节没问题。现在要找出 H 的来源...
nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
+ idSize*sizeof(char) + value.getSize();
您需要在这里再添加一个。您有跳过 (sizeType)、长度字节 (unsigned char)、id 本身 (idSize * sizeof(char)) 和值 (value.getSize()),但您还在 id 和值之间留下一个字节你没有考虑到。这就是为什么您的标签的最后一个字节被覆盖的原因 - 并且因为您在导致最高字节被损坏的小端机器上进行测试。
for(int i = 0; i < *((unsigned char*)idSize); ++i){
...
tbl_char_ptr = tbl_char_ptr + sizeof(char);
...
}
result_ptr = tbl_char_ptr + sizeof(char);
比 idSize 多 1。
作为实现编译器的 class 项目的一部分,我还实现了一个散列 table 用作编译器的符号 table。
散列table的实现旨在非常低级,手动打包的原始内存块,并且仅用于存储Token对象。因此,为了优化哈希 table 的可序列化性,我决定简单地内联 table 中的令牌,也就是说,当一个令牌时,简单地将令牌对象 memcpy 到 table 的内存中第一次插入。
我知道不应该 memcpy a class that has virtual functions or pointers, and that in general using memcpy on objects of a class is bad practice。但是,从下面的声明中可以看出,Token class 没有虚函数或指针,如果这不是低级编程练习,我不会使用 memcpy。
class Token {
public:
Token() : tag(BAD) {}
Token(Tag t) : tag(t) {}
Tag tag;
size_t getSize(){ return sizeof(Token); }
};
我遇到的问题是散列 table 正确插入了令牌,并且在查找相同的键时它找到了相同的内存位置,但是,当我尝试访问的成员时返回的Token指针,看来数据已经损坏了。
我编写了一个程序来测试符号 table 的简单输入。该程序执行以下操作:
- 将输入文件读入缓冲区。
- 通过将所有内置令牌插入 Lexer 符号来初始化 Lexer table。
- 在输入时调用 Lexer 的 getToken 方法并打印令牌的标签,直到读取 EOF 令牌。
虽然符号 table returns 与插入令牌的内存位置相同,但令牌的标签属性不再与插入的原始标签属性匹配。下面是程序在符号table:
中插入关键字program时的日志输出[debug] Entering SymbolTable.insert(const char* key)
[debug] bin: 48 Searching for the last element in this bin's linked list.
[debug] Last entry found... writing to entry's next location field.
[debug] Writing the new entry's identifier to the table.
[debug] The identifier: program has been written to the table.
[debug] The memory blocks are not equal prior to assignment.
[debug] The memory blocks are equal.
[debug] nextLoc: 571 tag: 46
[debug] Location of Token: 14287688
下面是程序随后在符号table.
中查找标识符程序时的日志输出[debug] Entering Lexer.getToken()
[debug] Entering SymbolTable.contains(const char* key)
[debug] Entering SymbolTable.find(const char* key) key: program
[debug] bin: 48
[debug] Search location: 541
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: a
[debug] Tag of entry: 1684368227
[debug] Location of Token: 14287653
[debug] Search location: 557
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: o to table char: o
[debug] Comparing key char: g to table char: c
[debug] Tag of entry: 1920296037
[debug] Location of Token: 14287668
[debug] Search location: 0
[debug] Comparing key char: p to table char: p
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: o to table char: o
[debug] Comparing key char: g to table char: g
[debug] Comparing key char: r to table char: r
[debug] Comparing key char: a to table char: a
[debug] Comparing key char: m to table char: m
[debug] Tag of entry: 1207959598
[debug] Location of Token: 14287688
The 1th token: 60
所以从Location of Token消息可以看出,符号table在内存中找到了它写入Token的相同位置,但是Token的标签不同。我很困惑为什么会这样。
为了完整起见,这里是 SymbolTable 的定义 class。
template<class sizeType>
class SymbolTable{
public:
SymbolTable();
~SymbolTable();
Token* operator[](const char* key);
bool contains(const char* key);
Token* insert(const char* key, Token value);
private:
void* find(const char* key);
static const size_t nbins = 64;
sizeType nextLoc;
void* table;
size_t hash(const char* key){
return (size_t)key[0] % nbins;
}
};
这里是符号 table 的插入、查找和运算符 [] 函数的源代码。
template<class sizeType> Token* SymbolTable<sizeType>::insert(const char* key, Token value){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.insert(const char* key)";
size_t bin = hash(key);
void *sizeType_ptr,
*tbl_char_ptr,
*idSize_ptr;
unsigned char idSize = 0;
const char *key_ptr = key;
Token *token_ptr = NULL;
// Find the last entry in this bin's linked list.
BOOST_LOG_TRIVIAL(debug) << "bin: " << bin
<< " Searching for the last element in this bin's linked list.";
sizeType_ptr = table + sizeof(sizeType)*bin;
while(*((sizeType*)sizeType_ptr) != 0){
sizeType_ptr = table + *((sizeType*)sizeType_ptr);
}
BOOST_LOG_TRIVIAL(debug) << "Last entry found... writing to entry's next location field.";
// Write the location of the new entry to this entry's next field.
*((sizeType*)sizeType_ptr) = nextLoc;
// Move to new entry's location.
sizeType_ptr = table + nextLoc;
// Write identifier
BOOST_LOG_TRIVIAL(debug) << "Writing the new entry's identifier to the table.";
tbl_char_ptr = sizeType_ptr + sizeof(sizeType) + sizeof(unsigned char);
while(*key_ptr != '[=14=]'){
*((char*)tbl_char_ptr) = *key_ptr;
tbl_char_ptr = tbl_char_ptr + sizeof(char);
++key_ptr;
++idSize;
}
BOOST_LOG_TRIVIAL(debug) << "The identifier: " << key << " has been written to the table.";
// Write length of identifier.
idSize_ptr = sizeType_ptr + sizeof(sizeType);
*((unsigned char*)idSize_ptr) = idSize;
token_ptr = (Token*)(tbl_char_ptr + sizeof(char));
void *tk = &value,
*tb = token_ptr;
for(int i = 0; i < value.getSize(); ++i){
if(*((char*)tk) != *((char*)tb)){
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not equal prior to assignment.";
break;
}
}
memcpy((void*)token_ptr, &value, value.getSize());
bool areEqual = true;
for(int i = 0; i < value.getSize(); ++i){
if(*((char*)tk) != *((char*)tb)){
areEqual = false;
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are not after assignment.";
break;
}
}
if(areEqual){
BOOST_LOG_TRIVIAL(debug) << "The memory blocks are equal.";
}
nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
+ idSize*sizeof(char) + value.getSize();
BOOST_LOG_TRIVIAL(debug) << "nextLoc: " << nextLoc
<< " tag: " << token_ptr->tag;
BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)token_ptr;
return token_ptr;
}
template<class sizeType>
void* SymbolTable<sizeType>::find(const char* key){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.find(const char* key) "
<< "key: " << key;
bool found = false;
size_t bin = hash(key);
void *idSize,
*sizeType_ptr,
*tbl_char_ptr,
*result_ptr = NULL;
const char* key_ptr = key;
BOOST_LOG_TRIVIAL(debug) << "bin: " << bin;
sizeType_ptr = table + sizeof(sizeType)*bin;
while(!found){
found = true;
// Is this the last element in this bin?
if(*((sizeType*)sizeType_ptr) == 0){
result_ptr = NULL;
return result_ptr;
}
// Advance to the next element in this bin's linked list.
sizeType_ptr = table + *((sizeType*)sizeType_ptr);
idSize = sizeType_ptr + sizeof(sizeType);
tbl_char_ptr = idSize + sizeof(unsigned char);
BOOST_LOG_TRIVIAL(debug) << "Search location: " << *((sizeType*)sizeType_ptr);
// Check if the passed key matches the current key in the table.
for(int i = 0; i < *((unsigned char*)idSize); ++i){
BOOST_LOG_TRIVIAL(debug) << "Comparing key char: " << *key_ptr
<< "to table char: " << *((const char*)tbl_char_ptr);
// Check if the key is too short or if the chars do not match.
if(*key_ptr != *((const char*)tbl_char_ptr)){
found = false;
break;
}
++key_ptr;
tbl_char_ptr = tbl_char_ptr + sizeof(char);
BOOST_LOG_TRIVIAL(debug) << "*(const char*)tbl_char_ptr: "
<< *((const char*)tbl_char_ptr);
}
result_ptr = tbl_char_ptr + sizeof(char);
BOOST_LOG_TRIVIAL(debug) << "Tag of entry: " << ((Token*)result_ptr)->tag;
BOOST_LOG_TRIVIAL(debug) << "Location of Token: " << (size_t)result_ptr;
key_ptr = key;
}
return result_ptr;
}
template<class sizeType>
Token* SymbolTable<sizeType>::operator[](const char* key){
BOOST_LOG_TRIVIAL(debug) << "Entering SymbolTable.operator[](const char* key)";
void* void_ptr = find(key);
Token* token_ptr = (Token*)void_ptr;
return token_ptr;
}
这里是测试程序的源代码。
int main(){
cout << "Executing testLexer.cpp" << endl;
ifstream file("./pascalPrograms/helloworld.pas");
string program((istreambuf_iterator<char>(file)), istreambuf_iterator<char>());
cout << "program:\n\n" << program << endl;
int fileSize = program.length();
const char* buffer = program.c_str();
const char* scanp = buffer;
cout << "Instantiating Lexer" << endl << endl;
Lexer lexer = Lexer(scanp);
Token* tok;
int i = 0;
cout << "Entering while loop to fetch tags." << endl << endl;
do{
i++;
tok = lexer.getToken();
cout << "The " << i << "th token: " << tok->tag << endl;
} while(tok->tag != END_OF_FILE);
return 0;
}
提前感谢您的帮助! :D
编辑:
这里是输入的Pascal程序:
program Hello;
begin
writeln ('Hello, world.');
readln
end.
并澄清问题:
当符号table中的Token为原件的精确副本?
找到了。您正在用 'H' 覆盖标签的第一个字节。其他字节没问题。现在要找出 H 的来源...
nextLoc = nextLoc + sizeof(sizeType) + sizeof(unsigned char)
+ idSize*sizeof(char) + value.getSize();
您需要在这里再添加一个。您有跳过 (sizeType)、长度字节 (unsigned char)、id 本身 (idSize * sizeof(char)) 和值 (value.getSize()),但您还在 id 和值之间留下一个字节你没有考虑到。这就是为什么您的标签的最后一个字节被覆盖的原因 - 并且因为您在导致最高字节被损坏的小端机器上进行测试。
for(int i = 0; i < *((unsigned char*)idSize); ++i){
...
tbl_char_ptr = tbl_char_ptr + sizeof(char);
...
}
result_ptr = tbl_char_ptr + sizeof(char);
比 idSize 多 1。