汇编中的矩阵乘法
Matrix Multiplication in Assembly
我正在用汇编语言编写一些矩阵乘法代码。我不能使用变量,只能在堆栈上存储我需要的东西。该算法似乎工作正常,但我在最后两个代码块中使用寄存器的 IMUL 和 MOV 有问题。我 post 我的代码在这里:
unsigned int m = 3; // raws of mat1
unsigned int n = 2; // columns of mat1
unsigned int k = 4; // columns of mat2
short int mat1[] = { -1,-2, 4,5, 4,-2 }; // first matrix
short int mat2[] = { 2,0,4,6, 0,2,-1,3 }; // second matrix
int mat3[1024]; // output matrix
__asm {
XOR EAX, EAX //mat1 raws counter
XOR EBX, EBX //mat2 columns counter
XOR EDX, EDX //mat1 columns(equal to mat2 raws) counter
XOR EDI, EDI //will contain sum of multiplications to be copied into output matrix
Loop1 : //determinates the raws of output matrix: mat3
XOR EBX, EBX //at the end of first raw, column counter is resetted
CMP m, EAX //if loopped mat1 m-raws times...
JZ Finale //...algortihm is over
INC EAX //increase mat1 raws counter
JMP Loop2
Loop2 : //determinates the columns of mat3
XOR EDX, EDX //at the end of the n-sums, mat1 column counter is resetted
XOR EDI, EDI //after sum of n-multiplications edi is resetted
CMP k, EBX //if multiplications/sums on this raw have been done...
JZ Loop1 //...go to next output matrix raw
INC EBX //increase mat2 columns counter
JMP Loop3
Loop3 : //determinates elements of mat3
CMP n, EDX //if the n-multiplacations/sums on first n-elements have been done...
JZ Loop2 //...skip to next n-elements
INC EDX //increase counter of the elements that will be multiplicate
JMP Stuffs //go to operations code block
Stuffs : //here code generates mat3 elements
#58 MOV SI, mat1[2 * ((EAX - 1) * 2 + (EDX - 1)] //moves to SI the [m-raws/n-clomumn] element of mat1
#59 IMUL SI, mat2[2 * ((EBX - 1) * 2 + (EDX - 1)] //multiplicates(with sign) SI and [n-raws/k-column] element of mat2
ADD DI, SI //sum the result in edi
CMP n, EDX //check the sums
JZ CopyResult //if n-sums have been done...
JMP Loop3 //...go to copy result into mat3
CopyResult :
#66 MOV mat3[4 * ((EAX - 1) * 4 + (EBX - 1))], EDI //copy value to output matrix mat3
JMP Loop3 //go to next n-elements
Finale :
}
{
unsigned int i, j, h;
printf("Output matrix:\n");
for (i = h = 0; i < m; i++) {
for (j = 0; j < k; j++, h++)
printf("%6d ", mat3[h]);
printf("\n");
}
}
在此代码中,编译器针对 mat1、mat2 和 mat3 报告两种类型的错误引用 IMUL 和 MOV。他们在这里:
- 第 58 行 - 错误 C2404 'EDX':'second operand'
中的非法寄存器
- 第 58 行 - 错误 C2430 'second operand'
中多于一个索引寄存器
第 59 行和第 66 行的错误与 EDX 和 EBX 寄存器相同。
这个算法基本好吗? (我测试了一些手动设置
索引,然后是最后一个,在调试期间很好,但我无法完全测试它)。
我认为第一个错误取决于第二个错误,但如果我
不能以这种方式使用寄存器,我该如何计算输出?
与其尝试在寻址模式 () 中将多个寄存器按两个缩放,不如使用 add eax, 2
而不是 inc eax
。
此外,由于您的输出矩阵使用 32 位 int
,因此您应该进行 32 位数学计算。您在 DI 中生成一个值,然后使用第 66 行将该值加上 EDI 高半部分中的任何垃圾进行存储。
有点像
movsx esi, word ptr [rowstart + column]
/
movsx eax, word ptr [offset_in_column + row]
/
imul eax, esi
可能适用于(部分)内循环主体。我会让你在第一种寻址模式中按列递增,在第二种寻址模式中按行递增。
根据我认为您正在尝试做的事情,我认为您的算法可能是合理的。对于输出矩阵的每个元素,循环遍历一个矩阵中的一列和另一个矩阵中的一行。所以你只对输出矩阵的每个元素存储一次。不管你的循环是否真的这样做了,IDK:分支的丑陋程度让我很伤心。 (查看优化编译器输出的某个循环,然后是双重或三重嵌套循环。例如 http://gcc.godbolt.org/)。
嵌套循环的其他方法对于大型矩阵的性能可能更好或更差,但唯一真正好的方法(对于大型矩阵)涉及转置输入矩阵之一,以便您可以循环连续两个矩阵中的内存元素一次(因为转置花费 O(n^2) 时间,但加快了重复遍历转置数组的 O(n^3) 步骤,因为它提供了更多缓存命中)。
(考虑到浮点 matmul 在科学计算中的普遍性,这是一个已经被广泛研究的主题,在代码的实验性调整中投入了大量精力。请参阅 BLAS 中 DGEMM 函数的各种实现。)
谢谢大家。这是我的三重嵌套循环和矩阵乘积的最终代码。对于我测试的内容,它适用于任何矩阵和 positive/negative 值:
void main()
{
unsigned int m = 3; // numero di righe della prima matrice
unsigned int n = 2; // numero di colonne della prima matrice
unsigned int k = 4; // numero di colonne della seconda matrice
short int mat1[] = { -1,-2, 4,5, 4,-2 }; // prima matrice
short int mat2[] = { 2,0,0,0, 0,2,0,0 }; // seconda matrice
int mat3[1024]; // matrice risultato
__asm {
lea eax, mat1 //load mat1
lea edi, mat3 //load mat result
push m
Loop3 : //extern loop
lea ebx, mat2 //load here mat2 to start from the beginning when new result raw starts
xor edx, edx //sets to zero column counter set to zero when new result row starts
Loop2 : //middle loop, as long as k, mat2 columns
xor ecx, ecx //sets to zero mat1 column counter every n multiplications
Loop1 : //inner loop
call Compute //calls sub program that calulates raw/column products
inc ecx //increase column counter
cmp ecx, n //check column counter
jb Loop1 //if below loop1 again
add ebx, 2 //if equal to n, inner loop is over, move mat2 position of one position
inc edx //increase mat2 column counter
cmp edx, k //chek mat2 column counter
jb Loop2 //if below loop2 again
imul esi, n, 2 //else calculate offset to skip to new raw in mat1
add eax, esi //...skip to new mat1 raw
imul esi, k, 4 //calculate offset to skip to new raw in result matrix(mat3)
add edi, esi //...skip to new raw in mat3
dec m //a raw in mat1 has been done, decrease its counter
cmp m, 0 //check how many raws has been done
ja Loop3 //if more than zero, do extern loop again
jmp Finale //else algorithm is over
Compute : //calulates raw/column products
movsx esi, WORD PTR[eax][ecx * 2]
push edi //stores mat3 address to free edi counter
push ecx //stores the actual value of column counter to free ecx register
mov edi, k //calculates the offset in mat2...
imul edi, ecx //...
movsx ecx, WORD PTR[ebx][edi * 2] //mov the value of mat2 to ecx
imul esi, ecx //multiplicates values of mat1 ad mat2
pop ecx //set back column counter
pop edi //set back mat3 address
cmp ecx, 0 //if ecx is zero...
je First //...is the first multiplication for this result value...
add[edi][edx * 4], esi //if not the first, add the value to current position
Back :
ret //in any case, comes back to loops...
First : //...so move here the first value to which add the others
mov[edi][edx * 4], esi //moving value
jmp Back
Finale : //the end
pop m //restore the original mat1 raw value to print the result matrix below
}
//Output on video
{
unsigned int i, j, h;
printf("Product Matrix:\n");
for (i = h = 0; i < m; i++) {
for (j = 0; j < k; j++, h++)
printf("%6d ", mat3[h]);
printf("\n");
}
}
}
这只是我从 stackoverflaw 中找到的双嵌套循环开始编写的三重嵌套循环的算法:
m = raws of first matrix
k = columns of second matrix
n = column of first matrix and raws of second matrix
x=0
Loop3:
y=0
Loop2:
z=0
Loop1:
compute...
z++
if(z<n)
go to Loop1
y++
if(y < k)
go to Loop2
x++
if(x < m)
go to Loop3
Else go to the End
我正在用汇编语言编写一些矩阵乘法代码。我不能使用变量,只能在堆栈上存储我需要的东西。该算法似乎工作正常,但我在最后两个代码块中使用寄存器的 IMUL 和 MOV 有问题。我 post 我的代码在这里:
unsigned int m = 3; // raws of mat1
unsigned int n = 2; // columns of mat1
unsigned int k = 4; // columns of mat2
short int mat1[] = { -1,-2, 4,5, 4,-2 }; // first matrix
short int mat2[] = { 2,0,4,6, 0,2,-1,3 }; // second matrix
int mat3[1024]; // output matrix
__asm {
XOR EAX, EAX //mat1 raws counter
XOR EBX, EBX //mat2 columns counter
XOR EDX, EDX //mat1 columns(equal to mat2 raws) counter
XOR EDI, EDI //will contain sum of multiplications to be copied into output matrix
Loop1 : //determinates the raws of output matrix: mat3
XOR EBX, EBX //at the end of first raw, column counter is resetted
CMP m, EAX //if loopped mat1 m-raws times...
JZ Finale //...algortihm is over
INC EAX //increase mat1 raws counter
JMP Loop2
Loop2 : //determinates the columns of mat3
XOR EDX, EDX //at the end of the n-sums, mat1 column counter is resetted
XOR EDI, EDI //after sum of n-multiplications edi is resetted
CMP k, EBX //if multiplications/sums on this raw have been done...
JZ Loop1 //...go to next output matrix raw
INC EBX //increase mat2 columns counter
JMP Loop3
Loop3 : //determinates elements of mat3
CMP n, EDX //if the n-multiplacations/sums on first n-elements have been done...
JZ Loop2 //...skip to next n-elements
INC EDX //increase counter of the elements that will be multiplicate
JMP Stuffs //go to operations code block
Stuffs : //here code generates mat3 elements
#58 MOV SI, mat1[2 * ((EAX - 1) * 2 + (EDX - 1)] //moves to SI the [m-raws/n-clomumn] element of mat1
#59 IMUL SI, mat2[2 * ((EBX - 1) * 2 + (EDX - 1)] //multiplicates(with sign) SI and [n-raws/k-column] element of mat2
ADD DI, SI //sum the result in edi
CMP n, EDX //check the sums
JZ CopyResult //if n-sums have been done...
JMP Loop3 //...go to copy result into mat3
CopyResult :
#66 MOV mat3[4 * ((EAX - 1) * 4 + (EBX - 1))], EDI //copy value to output matrix mat3
JMP Loop3 //go to next n-elements
Finale :
}
{
unsigned int i, j, h;
printf("Output matrix:\n");
for (i = h = 0; i < m; i++) {
for (j = 0; j < k; j++, h++)
printf("%6d ", mat3[h]);
printf("\n");
}
}
在此代码中,编译器针对 mat1、mat2 和 mat3 报告两种类型的错误引用 IMUL 和 MOV。他们在这里:
- 第 58 行 - 错误 C2404 'EDX':'second operand' 中的非法寄存器
- 第 58 行 - 错误 C2430 'second operand' 中多于一个索引寄存器
第 59 行和第 66 行的错误与 EDX 和 EBX 寄存器相同。
这个算法基本好吗? (我测试了一些手动设置 索引,然后是最后一个,在调试期间很好,但我无法完全测试它)。
我认为第一个错误取决于第二个错误,但如果我 不能以这种方式使用寄存器,我该如何计算输出?
与其尝试在寻址模式 (add eax, 2
而不是 inc eax
。
此外,由于您的输出矩阵使用 32 位 int
,因此您应该进行 32 位数学计算。您在 DI 中生成一个值,然后使用第 66 行将该值加上 EDI 高半部分中的任何垃圾进行存储。
有点像
movsx esi, word ptr [rowstart + column]
/
movsx eax, word ptr [offset_in_column + row]
/
imul eax, esi
可能适用于(部分)内循环主体。我会让你在第一种寻址模式中按列递增,在第二种寻址模式中按行递增。
根据我认为您正在尝试做的事情,我认为您的算法可能是合理的。对于输出矩阵的每个元素,循环遍历一个矩阵中的一列和另一个矩阵中的一行。所以你只对输出矩阵的每个元素存储一次。不管你的循环是否真的这样做了,IDK:分支的丑陋程度让我很伤心。 (查看优化编译器输出的某个循环,然后是双重或三重嵌套循环。例如 http://gcc.godbolt.org/)。
嵌套循环的其他方法对于大型矩阵的性能可能更好或更差,但唯一真正好的方法(对于大型矩阵)涉及转置输入矩阵之一,以便您可以循环连续两个矩阵中的内存元素一次(因为转置花费 O(n^2) 时间,但加快了重复遍历转置数组的 O(n^3) 步骤,因为它提供了更多缓存命中)。
(考虑到浮点 matmul 在科学计算中的普遍性,这是一个已经被广泛研究的主题,在代码的实验性调整中投入了大量精力。请参阅 BLAS 中 DGEMM 函数的各种实现。)
谢谢大家。这是我的三重嵌套循环和矩阵乘积的最终代码。对于我测试的内容,它适用于任何矩阵和 positive/negative 值:
void main()
{
unsigned int m = 3; // numero di righe della prima matrice
unsigned int n = 2; // numero di colonne della prima matrice
unsigned int k = 4; // numero di colonne della seconda matrice
short int mat1[] = { -1,-2, 4,5, 4,-2 }; // prima matrice
short int mat2[] = { 2,0,0,0, 0,2,0,0 }; // seconda matrice
int mat3[1024]; // matrice risultato
__asm {
lea eax, mat1 //load mat1
lea edi, mat3 //load mat result
push m
Loop3 : //extern loop
lea ebx, mat2 //load here mat2 to start from the beginning when new result raw starts
xor edx, edx //sets to zero column counter set to zero when new result row starts
Loop2 : //middle loop, as long as k, mat2 columns
xor ecx, ecx //sets to zero mat1 column counter every n multiplications
Loop1 : //inner loop
call Compute //calls sub program that calulates raw/column products
inc ecx //increase column counter
cmp ecx, n //check column counter
jb Loop1 //if below loop1 again
add ebx, 2 //if equal to n, inner loop is over, move mat2 position of one position
inc edx //increase mat2 column counter
cmp edx, k //chek mat2 column counter
jb Loop2 //if below loop2 again
imul esi, n, 2 //else calculate offset to skip to new raw in mat1
add eax, esi //...skip to new mat1 raw
imul esi, k, 4 //calculate offset to skip to new raw in result matrix(mat3)
add edi, esi //...skip to new raw in mat3
dec m //a raw in mat1 has been done, decrease its counter
cmp m, 0 //check how many raws has been done
ja Loop3 //if more than zero, do extern loop again
jmp Finale //else algorithm is over
Compute : //calulates raw/column products
movsx esi, WORD PTR[eax][ecx * 2]
push edi //stores mat3 address to free edi counter
push ecx //stores the actual value of column counter to free ecx register
mov edi, k //calculates the offset in mat2...
imul edi, ecx //...
movsx ecx, WORD PTR[ebx][edi * 2] //mov the value of mat2 to ecx
imul esi, ecx //multiplicates values of mat1 ad mat2
pop ecx //set back column counter
pop edi //set back mat3 address
cmp ecx, 0 //if ecx is zero...
je First //...is the first multiplication for this result value...
add[edi][edx * 4], esi //if not the first, add the value to current position
Back :
ret //in any case, comes back to loops...
First : //...so move here the first value to which add the others
mov[edi][edx * 4], esi //moving value
jmp Back
Finale : //the end
pop m //restore the original mat1 raw value to print the result matrix below
}
//Output on video
{
unsigned int i, j, h;
printf("Product Matrix:\n");
for (i = h = 0; i < m; i++) {
for (j = 0; j < k; j++, h++)
printf("%6d ", mat3[h]);
printf("\n");
}
}
}
这只是我从 stackoverflaw 中找到的双嵌套循环开始编写的三重嵌套循环的算法:
m = raws of first matrix
k = columns of second matrix
n = column of first matrix and raws of second matrix
x=0
Loop3:
y=0
Loop2:
z=0
Loop1:
compute...
z++
if(z<n)
go to Loop1
y++
if(y < k)
go to Loop2
x++
if(x < m)
go to Loop3
Else go to the End