用 Bit Packing 优化 C 中的矩阵乘法
Optimizing Matrix multiplication in C with Bit Packing
我目前正在尝试编写一种算法,用于使用位打包优化 GF(2) 上的矩阵乘法。矩阵 A
和 B
均按列主顺序提供,因此我首先将 A
复制为行主顺序,然后将值打包为 8 位整数并使用奇偶校验来加快速度上操作。我需要能够测试高达 2048x2048 的方阵,但是,我当前的实现提供了高达 24x24 的正确答案,然后无法计算出正确的结果。任何帮助,将不胜感激。
//Method which packs an array of integers into 8 bits
uint8_t pack(int *toPack) {
int i;
uint8_t A;
A = 0;
for (i = 0; i < 8; i++) {
A = (A << 1) | (uint8_t)toPack[i];
}
return A;
}
//Method for doing matrix multiplication over GF(2)
void matmul_optimized(int n, int *A, int *B, int *C) {
int i, j, k;
//Copying values of A into a row major order matrix.
int *A_COPY = malloc(n * n * sizeof(int));
int copy_index = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
A_COPY[copy_index] = A[i + j * n];
copy_index++;
}
}
//Size of the data data type integers will be packed into
const int portion_size = 8;
int portions = n / portion_size;
//Pointer space reserved to store packed integers in row major order
uint8_t *compressedA = malloc(n * portions * sizeof(uint8_t));
uint8_t *compressedB = malloc(n * portions * sizeof(uint8_t));
int a[portion_size];
int b[portion_size];
for (i = 0; i < n; i++) {
for (j = 0; j < portions; j++) {
for (k = 0; k < portion_size; k++) {
a[k] = A_COPY[i * n + j * portion_size + k];
b[k] = B[i * n + j * portion_size + k];
}
compressedA[i * n + j] = pack(a);
compressedB[i * n + j] = pack(b);
}
}
//Calculating final matrix using parity checking and XOR on A and B
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
int cIndex = i + j * n;
cij = C[cIndex];
for (k = 0; k < portions; ++k) {
uint8_t temp = compressedA[k + i * n] & compressedB[k + j * n];
temp ^= temp >> 4;
temp ^= temp >> 2;
temp ^= temp >> 1;
uint8_t parity = temp & (uint8_t)1;
cij = cij ^ parity;
}
C[cIndex] = cij;
}
}
free(compressedA);
free(compressedB);
free(A_COPY);
}
我有两点意见:
您可能应该将 cij
初始化为 0
而不是 cij = C[cIndex];
。更新目标矩阵而不是存储 A * B
的结果似乎是不正确的。您的代码可能巧合地适用于小矩阵,因为目标矩阵 C
恰好是这个大小的全零。
将分配大小计算为 malloc(n * n * sizeof(int));
是有风险的,因为如果 int
小于 [=20],n * n
可能会溢出 int n
=].考虑到您使用的大小,这可能不是问题,但始终使用 sizeof
作为第一个操作数以强制转换为以下操作数的 size_t
是个好主意:
int *A_COPY = malloc(sizeof(*A_COPY) * n * n);
我目前正在尝试编写一种算法,用于使用位打包优化 GF(2) 上的矩阵乘法。矩阵 A
和 B
均按列主顺序提供,因此我首先将 A
复制为行主顺序,然后将值打包为 8 位整数并使用奇偶校验来加快速度上操作。我需要能够测试高达 2048x2048 的方阵,但是,我当前的实现提供了高达 24x24 的正确答案,然后无法计算出正确的结果。任何帮助,将不胜感激。
//Method which packs an array of integers into 8 bits
uint8_t pack(int *toPack) {
int i;
uint8_t A;
A = 0;
for (i = 0; i < 8; i++) {
A = (A << 1) | (uint8_t)toPack[i];
}
return A;
}
//Method for doing matrix multiplication over GF(2)
void matmul_optimized(int n, int *A, int *B, int *C) {
int i, j, k;
//Copying values of A into a row major order matrix.
int *A_COPY = malloc(n * n * sizeof(int));
int copy_index = 0;
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
A_COPY[copy_index] = A[i + j * n];
copy_index++;
}
}
//Size of the data data type integers will be packed into
const int portion_size = 8;
int portions = n / portion_size;
//Pointer space reserved to store packed integers in row major order
uint8_t *compressedA = malloc(n * portions * sizeof(uint8_t));
uint8_t *compressedB = malloc(n * portions * sizeof(uint8_t));
int a[portion_size];
int b[portion_size];
for (i = 0; i < n; i++) {
for (j = 0; j < portions; j++) {
for (k = 0; k < portion_size; k++) {
a[k] = A_COPY[i * n + j * portion_size + k];
b[k] = B[i * n + j * portion_size + k];
}
compressedA[i * n + j] = pack(a);
compressedB[i * n + j] = pack(b);
}
}
//Calculating final matrix using parity checking and XOR on A and B
int cij;
for (i = 0; i < n; ++i) {
for (j = 0; j < n; ++j) {
int cIndex = i + j * n;
cij = C[cIndex];
for (k = 0; k < portions; ++k) {
uint8_t temp = compressedA[k + i * n] & compressedB[k + j * n];
temp ^= temp >> 4;
temp ^= temp >> 2;
temp ^= temp >> 1;
uint8_t parity = temp & (uint8_t)1;
cij = cij ^ parity;
}
C[cIndex] = cij;
}
}
free(compressedA);
free(compressedB);
free(A_COPY);
}
我有两点意见:
您可能应该将
cij
初始化为0
而不是cij = C[cIndex];
。更新目标矩阵而不是存储A * B
的结果似乎是不正确的。您的代码可能巧合地适用于小矩阵,因为目标矩阵C
恰好是这个大小的全零。将分配大小计算为
malloc(n * n * sizeof(int));
是有风险的,因为如果int
小于 [=20],n * n
可能会溢出int n
=].考虑到您使用的大小,这可能不是问题,但始终使用sizeof
作为第一个操作数以强制转换为以下操作数的size_t
是个好主意:int *A_COPY = malloc(sizeof(*A_COPY) * n * n);