Replacing a comparison with a decrement or increment can save you at least one byte.
I came up with this size optimization while trying to reduce
Snowdrop OS's 512-byte bootloader by a few tens of bytes. You might find this interesting if you like to write assembly language routines which are very small in byte size.
It is specific to the following scenario:
- variable (in register or memory) is known to have value either 0 or 1
- variable (in register or memory) is discardable - its value will not be needed past this comparison
Essentially:
; (assume it is known that register EAX contains 0 or 1)
DEC EAX ; compare
JZ was_one
; handle case when value was zero
was_one:
; handle case when value was one
In the example above, this optimization relies on
DEC reg32, which sets CPU Zero-flag similarly to TEST, CMP, OR, etc.
If the value of EAX was 0, DEC will clear the Zero flag.
If the value of EAX was 1, DEC will set the Zero flag.
Of course, there are other ways of testing for 0 that are more general, but also take up more bytes. The examples below are 32 bit and are routinely generated by compilers:
DEC EAX ; encodes to 0x48
; my optimization, works only if EAX is 0 or 1
TEST EAX, EAX ; encodes to 0x85, 0xC0
; commonly emitted for comparison to 0
OR EAX, EAX ; encodes to 0x09, 0xC0
; commonly emitted for comparison to 0
CMP EAX, 0 ; encodes to 0x83, 0xF8, 0x00
; inefficient, but most human-readable
This can be extended to memory references as well:
DEC BYTE PTR [ESI] ; encodes to 0xFE, 0x0E
TEST BYTE PTR [ESI], 0 ; encodes to 0xF6, 0x06, 0x00
CMP BYTE PTR [ESI], 0 ; encodes to 0x80, 0x3E, 0x00
Symmetrically, this optimization applies to the 0, -1 case as well. Simply replace
DEC with
INC to achieve similar results.
I then checked several C compilers to see which performed this same optimization, when asked to optimize for smallest size (e.g.: -Os for GCC). GCC proved to be the hardest to "beat" - that is, find a C program which would decrease in output binary size if a "comparison via DEC" were used instead.
Here are my findings:
CLANG 10.0 (using flag -Oz)
When given the same input C program as for GCC below, it relies on
TEST reg32, reg32, which emits larger assembly code in 32 bit (when flag -m32 is used) - than
DEC reg32.
The following example produces 64 bit output code that is candidate for my optimization:
static int staticValue1;
void test( int argument ) {
// this is known to be either 0 or 1
int zero_or_one = ( argument + staticValue1 ) % 2;
// this is the last time the value zero_or_one is needed, so this COULD
// be compared via dec to save one or more bytes, but isn't
if( zero_or_one == 0 ) {
staticValue1++;
}
}
This produces:
1 test(int): # @test(int)
2 mov eax, dword ptr [rip + staticValue1]
3 add edi, eax
4 test dil, 1
5 jne .LBB0_2
6 inc eax
7 mov dword ptr [rip + staticValue1], eax
8 .LBB0_2:
9 ret
Replacing
TEST DIL, 1 with
DEC RDI would save a byte - especially since the compiler already chooses to not preserve RDI in line 3, when preparing for the modulo 2 operation.
x64 MSVC 19.24 (using flag /O1)
For 64 bit code, the following is sufficient:
static int staticValue1;
static int staticValue2;
void test( int argument ) {
// this is known to be either 0 or 1
int zero_or_one = ( argument + staticValue1 + staticValue2 ) % 2;
// this is the last time the value zero_or_one is needed, so this COULD
// be compared via dec to save one or more bytes, but isn't
if( zero_or_one == 0 ) {
staticValue1 += argument;
}
}
It yields:
1 int staticValue1 DD 01H DUP (?) ; staticValue1
2 int staticValue2 DD 01H DUP (?) ; staticValue2
3 argument$ = 8
4 void test(int) PROC ; test, COMDAT
5 mov edx, DWORD PTR int staticValue2
6 mov r8d, DWORD PTR int staticValue1
7 add edx, r8d
8 add edx, ecx
9 test dl, 1
10 jne SHORT $LN2@test
11 add r8d, ecx
12 mov DWORD PTR int staticValue1, r8d
13 $LN2@test:
14 ret 0
15 void test(int) ENDP ; test
Line 9 could save one byte by relying on
DEC EDX instead.
x86 MSVC 19.10 (using flag /O1)
When given the same input C program as for GCC below, it relies on
TEST reg32, reg32, which emits larger assembly code in 32 bit (when flag -m32 is used) - than
DEC reg32.
GCC 10.1 (using flag -Os)
I used the source code below to see if I could find a way to make GCC emit a byte size larger than if DEC were used to compare.
static int staticValue1;
static int staticValue2;
void test( int argument ) {
// this is known to be either 0 or 1
int zero_or_one = ( argument + staticValue1 ) % 2;
// this is slightly contrived because it forces the compiler to
// involve further registers
staticValue1 += argument;
// this is slightly contrived because:
//
// 1. it forces compiler to not optimize checking value of zero_or_one by:
// and reg32, 1 ( where reg32 == argument + staticValue1 )
// by using the modulo 2 result before the branch decision
// 2. it forces compiler to not optimize via cmovXX by introducing more
// varied operands into the rvalue
staticValue2 = zero_or_one + staticValue2 + argument;
// this is the last time the value zero_or_one is needed, so this COULD
// be compared via dec to save one or more bytes, but isn't
if( zero_or_one == 0 ) {
staticValue1 += argument;
}
}
void reference() {
test( 3 );
test( 4 ); // since the argument ends up being passed via EDI, this second
// makes it clear that the value of EDI inside
// test( 3 ) is throwaway after the if statement
// since EDI will be set to 4 soon after, in preparation
// for invocation of test( 4 )
}
This outputs the following 64 bit code:
1 test(int):
2 mov ecx, edi
3 mov edi, DWORD PTR staticValue1[rip]
4 mov esi, 2
5 add edi, ecx
6 mov eax, edi
7 mov DWORD PTR staticValue1[rip], edi
8 cdq
9 idiv esi
10 lea eax, [rcx+rdx]
11 add DWORD PTR _ZL12staticValue2[rip], eax
12 test dil, 1
13 jne .L1
14 add edi, ecx
15 mov DWORD PTR staticValue1[rip], edi
16 .L1:
17 ret
18 reference():
19 mov edi, 3
20 call test(int)
21 mov edi, 4
22 jmp test(int)
Line 9 sets RDX = ( argument + staticValue1 ) % 2.
Line 12 relies on
TEST DIL, 1 to compare (4 bytes) instead of
DEC RDX (3 bytes), despite RDX not being needed after that.
When the flag -m32 was specified (force 32 bit), I used the following:
static int staticValue1;
static int staticValue2;
void test( int argument ) {
// this is known to be either 0 or 1
int zero_or_one = ( argument + staticValue1 + staticValue2 ) % 2;
// this is slightly contrived because:
//
// 1. it forces compiler to not optimize checking value of zero_or_one by:
// and reg32, 1 ( where reg32 == argument + staticValue1 )
// by using the modulo 2 result before the branch decision
// 2. it forces compiler to not optimize via cmovXX by introducing more
// varied operands into the rvalue
staticValue2 = zero_or_one + staticValue2 + argument;
// this is the last time the value zero_or_one is needed, so this COULD
// be compared via dec to save one or more bytes, but isn't
if( zero_or_one == 0 ) {
staticValue1 += argument;
}
}
GCC emitted the following the 32 bit code:
1 test(int):
2 push ebp
3 mov ebp, esp
4 push edi
5 mov edi, 2
6 push esi
7 mov esi, DWORD PTR staticValue2
8 push ebx
9 mov ebx, DWORD PTR [ebp+8]
10 add ebx, DWORD PTR staticValue1
11 lea ecx, [ebx+esi]
12 mov eax, ecx
13 cdq
14 idiv edi
15 add edx, esi
16 add edx, DWORD PTR [ebp+8]
17 and cl, 1
18 mov DWORD PTR staticValue2, edx
19 jne .L1
20 mov DWORD PTR staticValue1, ebx
21 .L1:
22 pop ebx
23 pop esi
24 pop edi
25 pop ebp
26 ret
Line 17 compares via
AND CL, 1 (3 bytes) instead of
DEC ECX (1 byte).
Conclusion
With optimizing compilers getting increasingly better, many assembly language-themed articles on the web never forget to mention that "it's probably best left to compilers."
I think this is unnecessarily discouraging. The kind of insight one gets from writing even small programs in assembler is unique. It's worth trying out at least with reduced scope. You might find out that you like so much that you adopt it for larger projects. It might also draw you into other interesting worlds, like reverse engineering, operating system development, hardware interfacing, etc.