哈希函数中对 c_str() 与 const char* 的函数调用

Question

我在 Whosebug 上查看哈希函数时发现了一个非常有趣的函数。它涉及将 const char* 转换为 size_t*，然后取消引用 size_t。然后将其移位到一定的精度。这适用于 const char*，每次都产生相同的值。但是，当我使用实际的字符串类型并改为调用 c_str() 时，生成的两个值不匹配。此外，在代码的每个运行上，字符串每个运行都会产生不同的值。有人知道为什么会这样吗？

const string l = "BA";
const char* k = l.c_str();
const char* p = "BA";
cout << k << " " << *((size_t*)k) << endl;
cout << p << " " << *((size_t*)p) << endl;

运行 1:

BA 140736766951746
BA 7162260525311607106

运行 2:

BA 140736985055554
BA 7162260525311607106

原问题：Have a good hash function for a C++ hash table?

Answer 1

// Simple null terminated character that is represented in memory as:
//
// ['B', 'A', '[=10=]']
const char* p = "BA";

// From the other side `std::string` isn't so simple
//
// c_str() returns a pointer to some kind of buffer.
//
// ['B', 'A', '[=10=]', ... reserved_memory]
//
const std::string l = "BA";
const char* k = l.c_str();

// Then you do a C-style cast.
//
// (size_t*)k that gives you the address to the beginning of the underlying
// data of the std::string (possibly it will be pointer on the heap or on
// stack depending on the SSO) and after that you dereference it to receive
// the value. BTW it can lead to the undefined behavior because you
// attempt to receive the value for 8 bytes (depending on the size_t size)
// but your actual string may be less than it, e.g. 4 bytes. As a result
// you will receive the garbage.
std::cout << k << " " << *((size_t*)k) << std::endl;

// Two strings created as
//
// const char* foo = "foo";
// const char* bar = "foo";
//
// are stored in the Read only segment of data in your executable. Actually
// two different pointers will point to the same string in this segment. Also
// note the same undefined behavior mentioned earlier.
std::cout << p << " " << *((size_t*)p) << std::endl;

Answer 2

我将从以下内容开始：

const string l = "BA";
const char* k = l.c_str();
const char* p = "BA";
cout << k << " " << *((size_t*)k) << endl;
cout << p << " " << *((size_t*)p) << endl;

*((size_t*)k) 和 *((size_t*)p) 都调用了未定义的行为。之所以如此，是因为在大多数系统上，它将访问超出 char 数组边界的数据。注意，sizeof(size_t) > 3 * sizeof(char) 用于 32 位和 64 位系统，因此 *((size_t*)k) 至少访问超出边界的一个字节。

在整个示例中，字符串文字（在您的系统上）可能至少与 sizeof(size_t) 对齐，并带有零填充（不要指望它，但看起来是这样）。这意味着字符串文字 "BA"（和 NUL 终止符）之后的垃圾是 NUL 字符。这在运行中是一致的。

如果 k 来自 std::string，你就没那么幸运了。字符串很短，所以大多数系统会采用短字符串优化。这意味着 char 缓冲区位于 std::string 对象中。在您的情况下，字符串太短了，以至于它的其余部分仍在专用于短字符串优化的缓冲区中。看起来，缓冲区的其余部分未初始化，并且包含垃圾。垃圾是在调用函数之前遗留下来的。结果除了BA[=21=]的前3个字节，其余都是随机垃圾。

你很幸运，这种未定义行为的情况最终会产生一些额外的垃圾，而不是更令人困惑的东西（比如总是返回零，或调用不相关的函数）。永远不要依赖 UB。

Answer 3

*((size_t*)k) 通过违反严格的别名规则导致未定义的行为。如果 k 实际上指向类型为 size_t 的对象，则此代码仅有效。

作为未定义的行为，看到奇怪的数字是一个可能的结果（和其他任何事情一样）。

我猜你的意图类似于：

size_t x;
memcpy(&x, k, sizeof x);
cout << k << " " << x << '\n';

现在应该很清楚问题出在哪里了。您的字符串仅包含 3 个字符（2 个加上空终止符），但是您尝试读取超过 3 个字符，这也会导致未定义的行为。

哈希函数中对 c_str() 与 const char* 的函数调用

Function call to c_str() vs const char* in hash function

c++

string

casting

size-t

c-str