Hand-optimizing x86/x64 assembly - optimize comparison to -1, 0, 1

Replacing a comparison with a decrement or increment can save you at least one byte.

I came up with this size optimization while trying to reduce Snowdrop OS's 512-byte bootloader by a few tens of bytes. You might find this interesting if you like to write assembly language routines which are very small in byte size.

It is specific to the following scenario:

variable (in register or memory) is known to have value either 0 or 1
variable (in register or memory) is discardable - its value will not be needed past this comparison

Essentially:


    ; (assume it is known that register EAX contains 0 or 1)



    DEC EAX                ; compare

    JZ was_one



    ; handle case when value was zero



was_one:

    ; handle case when value was one

In the example above, this optimization relies on DEC reg32, which sets CPU Zero-flag similarly to TEST, CMP, OR, etc.
If the value of EAX was 0, DEC will clear the Zero flag.
If the value of EAX was 1, DEC will set the Zero flag.

Of course, there are other ways of testing for 0 that are more general, but also take up more bytes. The examples below are 32 bit and are routinely generated by compilers:


DEC EAX          ; encodes to 0x48

                 ; my optimization, works only if EAX is 0 or 1


TEST EAX, EAX    ; encodes to 0x85, 0xC0

                 ; commonly emitted for comparison to 0


OR EAX, EAX      ; encodes to 0x09, 0xC0

                 ; commonly emitted for comparison to 0


CMP EAX, 0       ; encodes to 0x83, 0xF8, 0x00

                 ; inefficient, but most human-readable

This can be extended to memory references as well:


DEC BYTE PTR [ESI]      ; encodes to 0xFE, 0x0E


TEST BYTE PTR [ESI], 0  ; encodes to 0xF6, 0x06, 0x00


CMP BYTE PTR [ESI], 0   ; encodes to 0x80, 0x3E, 0x00

Symmetrically, this optimization applies to the 0, -1 case as well. Simply replace DEC with INC to achieve similar results.

I then checked several C compilers to see which performed this same optimization, when asked to optimize for smallest size (e.g.: -Os for GCC). GCC proved to be the hardest to "beat" - that is, find a C program which would decrease in output binary size if a "comparison via DEC" were used instead.

Here are my findings:

CLANG 10.0 (using flag -Oz)

When given the same input C program as for GCC below, it relies on TEST reg32, reg32, which emits larger assembly code in 32 bit (when flag -m32 is used) - than DEC reg32.

The following example produces 64 bit output code that is candidate for my optimization:




static int staticValue1;



void test( int argument ) {

    // this is known to be either 0 or 1

    int zero_or_one = ( argument + staticValue1 ) % 2;



    // this is the last time the value zero_or_one is needed, so this COULD

    // be compared via dec to save one or more bytes, but isn't

    if( zero_or_one == 0 ) {

        staticValue1++;

    }

}

This produces:




1  test(int):                               # @test(int)

2          mov     eax, dword ptr [rip + staticValue1]

3          add     edi, eax

4          test    dil, 1

5          jne     .LBB0_2

6          inc     eax

7          mov     dword ptr [rip + staticValue1], eax

8  .LBB0_2:

9          ret

Replacing TEST DIL, 1 with DEC RDI would save a byte - especially since the compiler already chooses to not preserve RDI in line 3, when preparing for the modulo 2 operation.

x64 MSVC 19.24 (using flag /O1)

For 64 bit code, the following is sufficient:




static int staticValue1;

static int staticValue2;



void test( int argument ) {

    // this is known to be either 0 or 1

    int zero_or_one = ( argument + staticValue1 + staticValue2 ) % 2;



    // this is the last time the value zero_or_one is needed, so this COULD

    // be compared via dec to save one or more bytes, but isn't

    if( zero_or_one == 0 ) {

        staticValue1 += argument;

    }

}

It yields:




1  int staticValue1 DD 01H DUP (?)                 ; staticValue1

2  int staticValue2 DD 01H DUP (?)                 ; staticValue2



3  argument$ = 8

4  void test(int) PROC                                  ; test, COMDAT

5          mov     edx, DWORD PTR int staticValue2

6          mov     r8d, DWORD PTR int staticValue1

7          add     edx, r8d

8          add     edx, ecx

9          test    dl, 1

10         jne     SHORT $LN2@test

11         add     r8d, ecx

12         mov     DWORD PTR int staticValue1, r8d

13 $LN2@test:

14         ret     0

15 void test(int) ENDP                                  ; test

Line 9 could save one byte by relying on DEC EDX instead.

x86 MSVC 19.10 (using flag /O1)

When given the same input C program as for GCC below, it relies on TEST reg32, reg32, which emits larger assembly code in 32 bit (when flag -m32 is used) - than DEC reg32.

GCC 10.1 (using flag -Os)

I used the source code below to see if I could find a way to make GCC emit a byte size larger than if DEC were used to compare.




static int staticValue1;

static int staticValue2;



void test( int argument ) {

    // this is known to be either 0 or 1

    int zero_or_one = ( argument + staticValue1 ) % 2;



    // this is slightly contrived because it forces the compiler to

    // involve further registers

    staticValue1 += argument;



    // this is slightly contrived because:

    // 

    // 1. it forces compiler to not optimize checking value of zero_or_one by:

    //     and reg32, 1     ( where reg32 == argument + staticValue1 )

    //    by using the modulo 2 result before the branch decision

    // 2. it forces compiler to not optimize via cmovXX by introducing more

    //    varied operands into the rvalue

    staticValue2 = zero_or_one + staticValue2 + argument;

    

    // this is the last time the value zero_or_one is needed, so this COULD

    // be compared via dec to save one or more bytes, but isn't

    if( zero_or_one == 0 ) {

        staticValue1 += argument;

    }

}



void reference() {

    test( 3 );

    test( 4 ); // since the argument ends up being passed via EDI, this second

               // makes it clear that the value of EDI inside

               // test( 3 ) is throwaway after the if statement

               // since EDI will be set to 4 soon after, in preparation

               // for invocation of test( 4 )

}

This outputs the following 64 bit code:




1  test(int):

2         mov     ecx, edi

3         mov     edi, DWORD PTR staticValue1[rip]

4         mov     esi, 2

5         add     edi, ecx

6         mov     eax, edi

7         mov     DWORD PTR staticValue1[rip], edi

8         cdq

9         idiv    esi

10        lea     eax, [rcx+rdx]

11        add     DWORD PTR _ZL12staticValue2[rip], eax

12        test    dil, 1

13        jne     .L1

14        add     edi, ecx

15        mov     DWORD PTR staticValue1[rip], edi

16 .L1:

17        ret

18 reference():

19        mov     edi, 3

20        call    test(int)

21        mov     edi, 4

22        jmp     test(int)

Line 9 sets RDX = ( argument + staticValue1 ) % 2.
Line 12 relies on TEST DIL, 1 to compare (4 bytes) instead of DEC RDX (3 bytes), despite RDX not being needed after that.

When the flag -m32 was specified (force 32 bit), I used the following:




static int staticValue1;

static int staticValue2;



void test( int argument ) {

    // this is known to be either 0 or 1

    int zero_or_one = ( argument + staticValue1 + staticValue2 ) % 2;



    // this is slightly contrived because:

    // 

    // 1. it forces compiler to not optimize checking value of zero_or_one by:

    //     and reg32, 1     ( where reg32 == argument + staticValue1 )

    //    by using the modulo 2 result before the branch decision

    // 2. it forces compiler to not optimize via cmovXX by introducing more

    //    varied operands into the rvalue

    staticValue2 = zero_or_one + staticValue2 + argument;

    

    // this is the last time the value zero_or_one is needed, so this COULD

    // be compared via dec to save one or more bytes, but isn't

    if( zero_or_one == 0 ) {

        staticValue1 += argument;

    }

}

GCC emitted the following the 32 bit code:




1  test(int):

2         push    ebp

3         mov     ebp, esp

4         push    edi

5         mov     edi, 2

6         push    esi

7         mov     esi, DWORD PTR staticValue2

8         push    ebx

9         mov     ebx, DWORD PTR [ebp+8]

10        add     ebx, DWORD PTR staticValue1

11        lea     ecx, [ebx+esi]

12        mov     eax, ecx

13        cdq

14        idiv    edi

15        add     edx, esi

16        add     edx, DWORD PTR [ebp+8]

17        and     cl, 1

18        mov     DWORD PTR staticValue2, edx

19        jne     .L1

20        mov     DWORD PTR staticValue1, ebx

21 .L1:

22        pop     ebx

23        pop     esi

24        pop     edi

25        pop     ebp

26        ret

Line 17 compares via AND CL, 1 (3 bytes) instead of DEC ECX (1 byte).

Conclusion

With optimizing compilers getting increasingly better, many assembly language-themed articles on the web never forget to mention that "it's probably best left to compilers."

I think this is unnecessarily discouraging. The kind of insight one gets from writing even small programs in assembler is unique. It's worth trying out at least with reduced scope. You might find out that you like so much that you adopt it for larger projects. It might also draw you into other interesting worlds, like reverse engineering, operating system development, hardware interfacing, etc.