将位写入文件？

Question

我正在尝试实现哈夫曼树。

我想做一个简单测试的简单 .txt 文件的内容：

aaaaabbbbccd

字符出现频率：a:5、b:4、c:2、d:1

代码Table:(1s和0s的数据类型：string)

a:0
d:100
c:101
b:11

我想写成二进制的结果：（22 位）

0000011111111101101100

如何将此结果的每个字符逐位写入“.dat”文件？（不是字符串）

Answer 1

回答：不能。

您可以写入文件（或从文件中读取）的最小数量是 char 或 unsigned char。出于所有实际目的，一个 char 恰好有八位。

您将需要一个单字符缓冲区，以及它包含的位数。当该数字达到 8 时，您需要将其写出，并将计数重置为 0。您还需要一种方法来在最后刷新缓冲区。（并不是说你不能将 22 位写入文件 - 你只能写入 16 或 24。你需要一些方法来标记末尾的哪些位未使用。）

类似于：

struct BitBuffer {
    FILE* file; // Initialization skipped.
    unsigned char buffer = 0;
    unsigned count = 0;

    void outputBit(unsigned char bit) {
         buffer <<= 1;         // Make room for next bit.
         if (bit) buffer |= 1; // Set if necessary.
         count++;              // Remember we have added a bit.
         if (count == 8) {
             fwrite(&buffer, sizeof(buffer), 1, file); // Error handling elided.
             buffer = 0;
             count = 0;
         }
    }
};

Answer 2

OP 问：

How can I write bit-by-bit each character of this result as a binary to ".dat" file? (not as string)

你不能，这就是为什么......

Memory model

Defines the semantics of a computer memory storage for the purpose of C++ abstract machine.

The memory available to a C++ program is one or more contiguous sequences of bytes. Each byte in memory has a unique address.

Byte

A byte is the smallest addressable unit of memory. It is defined as a contiguous sequence of bits, large enough to hold the value of any UTF-8 code unit (256 distinct values) and of (since C++14) any member of the basic execution character set (the 96 characters that are required to be single-byte). Similar to C, C++ supports bytes of sizes 8 bits and greater.

The types char, unsigned char, and signed char use one byte for both storage and value representation. The number of bits in a byte is accessible as CHAR_BIT or std::numeric_limits<unsigned char>::digits.

cppreference.com

的赞美

您可以在此处找到此页面：cppreference:memory model

本文来自2017-03-21:标准

©ISO/IEC N4659

4.4 The C++ memory model [intro.memory]
The fundamental storage unit in the C++ memory model is the byte. A byte is at least large enough to contain any member of the basic execution character set (5.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits,⁴ the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit. The memory available to a C++ program consists of one or more sequences of contiguous bytes. Every byte has a unique address.

[ Note: The representation of types is described in 6.9. —end note ]

A memory location is either an object of scalar type or a maximal sequence of adjacent bit-fields all having nonzero width. [ Note: Various features of the language, such as references and virtual functions, might involve additional memory locations that are not accessible to programs but are managed by the implementation. —end note ] Two or more threads of execution (4.7) can access separate memory locations without interfering with each other.

[ Note: Thus a bit-field and an adjacent non-bit-field are in separate memory locations, and therefore can be concurrently updated by two threads of execution without interference. The same applies to two bit-fields, if one is declared inside a nested struct declaration and the other is not, or if the two are separated by a zero-length bit-field declaration, or if they are separated by a non-bit-field declaration. It is not safe to concurrently update two bit-fields in the same struct if all fields between them are also bit-fields of nonzero width. —end note ]
[ Example: A structure declared as
struct {
    char a;
    int b:5,
    c:11,
    :0,
    d:8;
    struct {int ee:8;} e;
}
contains four separate memory locations: The field a and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. —end example ]
^{4) The number of bits in a byte is reported by the macro CHAR_BIT in the header <climits>.}

可以在此处找到该版本的标准： www.open-std.org 第 8 页和第 9 页的 § 4.4 部分。

程序中可写入的最小内存模块是 8 个连续位或更多的标准字节。即使有位字段，1 byte 要求仍然成立。您可以在 byte 中操作、切换、设置单个位，但不能写入单个 bits.

可以做的是有一个 byte 缓冲区，其中包含写入的位数。写入所需的位后，您需要将其余未使用的位标记为 padding 或 un-used buffer bits。

编辑

[注：] -- 使用bit fields或unions时必须带的一件事考虑的是具体架构的endian。

Answer 3

回答：在某种程度上可以。

您好，根据我的经验，我找到了一种简单的方法。对于您需要定义自己和字符数组的任务（它只需要例如 1 个字节，它可以更大）。之后，您必须定义函数以访问任何元素的特定位。比如C++中如何写表达式获取一个char的第3位的值

*/*position is [1,..,n], and bytes 
are in little endian and index from 0`enter code here`*/
int bit_at(int position, unsigned char byte)
{
  return (byte & (1 << (position - 1)));
}*

现在你可以想象字节数组是这样的 [b1,...,bn]

现在我们在内存中实际拥有的是8 * n位内存我们可以尝试像这样想象它。注意：数组已归零！ |0000 0000|0000 0000|...|0000 0000|

现在您或任何想要的人都可以从中弄清楚如何操纵它以从此数组中获取特定位。当然会有某种转换，但这不是问题。最后，对于您提供的编码，即： a:0 d:100 c:101 b:11

我们可以对消息“abcd”进行编码，并制作一个包含这些位的数组消息，使用元素数组作为位数组，像这样：

|0111 0110|0000 0000|

你可以把这个写到内存中，最多多出7位。这是一个简单的例子，但它可以扩展到更多。我希望这能为您的问题提供一些答案。

将位写入文件？

Writing bits to file?

c++

binary

ofstream

huffman-code

回答：在某种程度上可以。