位掩码 - C 中的按位运算
Bits mask - bitwise operations in C
示例:我有整数 0000010010001110
的二进制表示
如何通过 110..... 0.......
屏蔽这些位?
我需要在掩码中保存零并保存所有活动位
在以下整数 110010001110
我是按位运算的新手所以请给我一些想法或建议,谢谢。
更新。我需要屏蔽 wchar_t 并以 unicode (UTF-8) 格式输出
代表性
Read the UTF-8 specs for more detail, but at a high level:
Code points 0 – 007F are stored as regular, single-byte ASCII. Code
points 0080 and above are converted to binary and stored (encoded) in
a series of bytes. The first “count” byte indicates the number of
bytes for the codepoint, including the count byte. These bytes start
with 11..0:
110xxxxx (The leading “11” is indicates 2 bytes in sequence, including
the “count” byte)
1110xxxx (1110 -> 3 bytes in sequence)
11110xxx (11110 -> 4 bytes in sequence)
Bytes starting with 10… are “data” bytes and contain information for
the codepoint. A 2-byte example looks like this
110xxxxx 10xxxxxx
不清楚你到底需要什么,但是如果你需要"recognize"“计数”字节和“数据”字节的类型,对于给定的例子:
1100000110100000
(11000001 10100000)
到"recognize""count"字节你可以使用:
#define BIT_MASK 0X8000 //which gives---1000 0000 0000 0000
然后使用运算符&
检查是否设置了位,使用counter
计算设置了多少位,使用<<
运算符左移(8次最大限度)。如果出现未设置的位,则中断。
#include <stdio.h>
#include <stdint.h>
#define BIT_MASK 0x8000
#define MAX_LEFT_SHIFT 8
int main(void)
{
uint16_t exm_num = 49568;// for example 11000001 10100000 in binary
int i,count=0;
for(i=0;i<MAX_LEFT_SHIFT;++i){
if (exm_num & BIT_MASK)
++count;
else
break;
exm_num = exm_num<<1;
}
return 0;
}
然后您可以使用 count
的最终值来识别类型。
给定示例的输出为 2
i'm need to mask wchar_t and output that in unicode (UTF-8) representation
你读过UTF-8 in the Unicode standard (Section 3.9 - Unicode Encoded Forms), or RFC 3629, or even the UTF-8 documentation on Wikipedia的官方规范了吗?
他们描述了将 21 位代码点数字拆分为编码字节序列所需的算法。请注意,wchar_t
在大多数其他平台上是 16 位 (UTF-16) on Windows but is 32-bit (UTF-32)。 UTF 之间的转换相当简单,但您必须考虑 UTF 的实际含义,因为将 UTF-16 转换为 UTF-8 与将 UTF-32 转换为 UTF-8 有点不同。
简而言之,您需要这样的东西:
uint32_t codepoint = ...;
// This is the actual codepoint number, decoded from 1 or 2 wchar_t
// elements, depending on the UTF encoding of the wchar_t sequence.
// In UTF-32, the characters are the actual codepoint numbers as-is.
// In UTF-16, only the characters <= 0xFFFF are the actual codepoint
// numbers, the rest are encoded using surrogate pairs that you would
// have to decode to get the actual codepoint numbers...
uint8_t bytes[4];
int numBytes = 0;
if (codepoint <= 0x7F)
{
bytes[0] = (uint8_t) codepoint;
numBytes = 1;
}
else if (codepoint <= 0x7FF)
{
bytes[0] = 0xC0 | (uint8_t) ((codepoint >> 6) & 0x1F);
bytes[1] = 0x80 | (uint8_t) (codepoint & 0x3F);
numBytes = 2;
}
else if (codepoint <= 0xFFFF)
{
bytes[0] = 0xE0 | (uint8_t) ((codepoint >> 12) & 0x0F);
bytes[1] = 0x80 | (uint8_t) ((codepoint >> 6) & 0x3F);
bytes[2] = 0x80 | (uint8_t) (codepoint & 0x3F);
numBytes = 3;
}
else if (codepoint <= 0x10FFFF)
{
bytes[0] = 0xF0 | (uint8_t) ((codepoint >> 18) & 0x07);
bytes[1] = 0x80 | (uint8_t) ((codepoint >> 12) & 0x3F);
bytes[2] = 0x80 | (uint8_t) ((codepoint >> 6) & 0x3F);
bytes[3] = 0x80 | (uint8_t) (codepoint & 0x3F);
numBytes = 4;
}
else
{
// illegal!
}
// use bytes[] up to numBytes as needed...
可以简化成这样:
uint32_t codepoint = ...; // decoded from wchar_t sequence...
uint8_t bytes[4];
int numBytes = 0;
if (codepoint <= 0x7F)
{
bytes[0] = 0x00;
numBytes = 1;
}
else if (codepoint <= 0x7FF)
{
bytes[0] = 0xC0;
numBytes = 2;
}
else if (codepoint <= 0xFFFF)
{
bytes[0] = 0xE0;
numBytes = 3;
}
else if (codepoint <= 0x10FFFF)
{
bytes[0] = 0xF0;
numBytes = 4;
}
else
{
// illegal!
}
for(int i = 1; i < numBytes; ++i)
{
bytes[numBytes-i] = 0x80 | (uint8_t) (codepoint & 0x3F);
codepoint >>= 6;
}
bytes[0] |= (uint8_t) codepoint;
// use bytes[] up to numBytes as needed...
在您的示例中,0000010010001110
是十进制 1166,十六进制 0x48E。 Codepoint U+048E 以 UTF-8 编码为字节 0xD2 0x8E
,例如:
0000010010001110b -> 010010b 001110b
0xC0 or 010010b -> 0xD2
0x80 or 001110b -> 0x8E
示例:我有整数 0000010010001110
如何通过 110..... 0.......
屏蔽这些位?
我需要在掩码中保存零并保存所有活动位
在以下整数 110010001110
我是按位运算的新手所以请给我一些想法或建议,谢谢。
更新。我需要屏蔽 wchar_t 并以 unicode (UTF-8) 格式输出 代表性
Read the UTF-8 specs for more detail, but at a high level:
Code points 0 – 007F are stored as regular, single-byte ASCII. Code points 0080 and above are converted to binary and stored (encoded) in a series of bytes. The first “count” byte indicates the number of bytes for the codepoint, including the count byte. These bytes start with 11..0:
110xxxxx (The leading “11” is indicates 2 bytes in sequence, including the “count” byte)
1110xxxx (1110 -> 3 bytes in sequence)
11110xxx (11110 -> 4 bytes in sequence)
Bytes starting with 10… are “data” bytes and contain information for the codepoint. A 2-byte example looks like this
110xxxxx 10xxxxxx
不清楚你到底需要什么,但是如果你需要"recognize"“计数”字节和“数据”字节的类型,对于给定的例子:
1100000110100000
(11000001 10100000)
到"recognize""count"字节你可以使用:
#define BIT_MASK 0X8000 //which gives---1000 0000 0000 0000
然后使用运算符&
检查是否设置了位,使用counter
计算设置了多少位,使用<<
运算符左移(8次最大限度)。如果出现未设置的位,则中断。
#include <stdio.h>
#include <stdint.h>
#define BIT_MASK 0x8000
#define MAX_LEFT_SHIFT 8
int main(void)
{
uint16_t exm_num = 49568;// for example 11000001 10100000 in binary
int i,count=0;
for(i=0;i<MAX_LEFT_SHIFT;++i){
if (exm_num & BIT_MASK)
++count;
else
break;
exm_num = exm_num<<1;
}
return 0;
}
然后您可以使用 count
的最终值来识别类型。
给定示例的输出为 2
i'm need to mask wchar_t and output that in unicode (UTF-8) representation
你读过UTF-8 in the Unicode standard (Section 3.9 - Unicode Encoded Forms), or RFC 3629, or even the UTF-8 documentation on Wikipedia的官方规范了吗?
他们描述了将 21 位代码点数字拆分为编码字节序列所需的算法。请注意,wchar_t
在大多数其他平台上是 16 位 (UTF-16) on Windows but is 32-bit (UTF-32)。 UTF 之间的转换相当简单,但您必须考虑 UTF 的实际含义,因为将 UTF-16 转换为 UTF-8 与将 UTF-32 转换为 UTF-8 有点不同。
简而言之,您需要这样的东西:
uint32_t codepoint = ...;
// This is the actual codepoint number, decoded from 1 or 2 wchar_t
// elements, depending on the UTF encoding of the wchar_t sequence.
// In UTF-32, the characters are the actual codepoint numbers as-is.
// In UTF-16, only the characters <= 0xFFFF are the actual codepoint
// numbers, the rest are encoded using surrogate pairs that you would
// have to decode to get the actual codepoint numbers...
uint8_t bytes[4];
int numBytes = 0;
if (codepoint <= 0x7F)
{
bytes[0] = (uint8_t) codepoint;
numBytes = 1;
}
else if (codepoint <= 0x7FF)
{
bytes[0] = 0xC0 | (uint8_t) ((codepoint >> 6) & 0x1F);
bytes[1] = 0x80 | (uint8_t) (codepoint & 0x3F);
numBytes = 2;
}
else if (codepoint <= 0xFFFF)
{
bytes[0] = 0xE0 | (uint8_t) ((codepoint >> 12) & 0x0F);
bytes[1] = 0x80 | (uint8_t) ((codepoint >> 6) & 0x3F);
bytes[2] = 0x80 | (uint8_t) (codepoint & 0x3F);
numBytes = 3;
}
else if (codepoint <= 0x10FFFF)
{
bytes[0] = 0xF0 | (uint8_t) ((codepoint >> 18) & 0x07);
bytes[1] = 0x80 | (uint8_t) ((codepoint >> 12) & 0x3F);
bytes[2] = 0x80 | (uint8_t) ((codepoint >> 6) & 0x3F);
bytes[3] = 0x80 | (uint8_t) (codepoint & 0x3F);
numBytes = 4;
}
else
{
// illegal!
}
// use bytes[] up to numBytes as needed...
可以简化成这样:
uint32_t codepoint = ...; // decoded from wchar_t sequence...
uint8_t bytes[4];
int numBytes = 0;
if (codepoint <= 0x7F)
{
bytes[0] = 0x00;
numBytes = 1;
}
else if (codepoint <= 0x7FF)
{
bytes[0] = 0xC0;
numBytes = 2;
}
else if (codepoint <= 0xFFFF)
{
bytes[0] = 0xE0;
numBytes = 3;
}
else if (codepoint <= 0x10FFFF)
{
bytes[0] = 0xF0;
numBytes = 4;
}
else
{
// illegal!
}
for(int i = 1; i < numBytes; ++i)
{
bytes[numBytes-i] = 0x80 | (uint8_t) (codepoint & 0x3F);
codepoint >>= 6;
}
bytes[0] |= (uint8_t) codepoint;
// use bytes[] up to numBytes as needed...
在您的示例中,0000010010001110
是十进制 1166,十六进制 0x48E。 Codepoint U+048E 以 UTF-8 编码为字节 0xD2 0x8E
,例如:
0000010010001110b -> 010010b 001110b 0xC0 or 010010b -> 0xD2 0x80 or 001110b -> 0x8E