September 17, 2023

Raspberry Pi Pico - Inline assembly and integer division

I'll work my way to integer division later in this essay. For now I'll just note that the ARM instruction set for the RP2040 (and many other ARM microprocessors) does not include a divide instruction.

The Gnu C compiler (gcc) supports inline assembly within C code. As they say, there are two reasons to use inline assembly. One is for code optimization, the other is access to specific instructions that the compiler is unable to generate for you. My use here is a bit of both.

First I want to explain one thing that has just become clear to me and that has greatly aided my understanding. Consider a statement like the following:

asm volatile ( "movs   %[digit], #0x7\n\t" : [digit] "=r" (d) );
I am clearly diving in without explaining many things, but stick with me. This is equivalent to the C statement "d=7;" and of course we are using inline assembly here just for the sake of illustration. The clause "[digit] "=r" (d)" is what establishes the connection between the assembly language world and the C world. The symbol "d" -- as in (d) is the C variable we want to act on. In the assembly world we use "digit" to refer to it. The "=r" business indicates it is an output register.

The important idea is that %[digit] refers to a register. It is the compilers problem to assign that register and know that is needs to put the value in the register somewhere and operate on it.

What I am saying might be clearer if I was talking about an input register. In that case it would be the compilers problem to assign that register and to get the value of the C variable into that register. That is all done for you and you don't have to worry about it.

All of this means that you job is easier than you thought it might be. You just decide what needs to be done with ARM registers coming and going and the compiler sets the up for you coming in and takes them from you going out. Having said all of that, here is a great tutorial on gcc inline assembly for the ARM:

Integer division and the RP2040

As previously mentioned, the ARM core in the RP2040 lacks a divide instruction. The RP2040 does have divide hardware, set up in a unique way as a peripheral. I need to use integer division to support the %d aspect of a printf routine I have. Without taking special steps the compiler will generate calls to libgcc.a Even that does not solve the problem because apparently those use some instruction (perhaps MUL or a variant thereof) that is an undefined instruction on this particular ARM core. The official SDK has solved this in some way that I will investigate someday.

The original code in my printf for %d looked like this:

	do {
            *cp++ = hex_table[n%10];
            n /= 10;
        } while (n);

Note the % to get the modulo and the divide by 10. I recoded this like so:
     do {
            d = digit ( &n );
            *cp++ = hex_table[d];
        } while (n);
Here the function "digit" does the divide by 10, returning the remainder and modifying the value of n (dividing it by 10). The assembly code I first wrote and put into start.S looks like this:
#define SIO_BASE                    0xD0000000

// Calculate p/q where p is argument and q is 10
.global digit
.thumb_func
digit:
        ldr     r1,=SIO_BASE
        ldr     r2, [r0]
        str     r2, [r1,#0x60]
        movs    r2, #10         // divide by 10
        str     r2, [r1,#0x64]

        // Delay for 8 cycles
        // each branch gives us 2 cycles
    b 1f
1:  b 1f
1:  b 1f
1:  b 1f
1:

// Must read quotient last
        ldr     r3, [r1,#0x74]  // remainder
        ldr     r2, [r1,#0x70]  // quotient
        str     r2, [r0]
        movs    r0, r3
        bx      lr
This worked just fine, but I wanted to see if I could use inline assembly code, and I ended up with this:
// #define SIO_BASE    0xD0000000

       do {
            asm volatile (
                "ldr    r1, =0xD0000000\n\t"
                "str    %[value], [r1,#0x60]\n\t"
                "movs   r2, #10\n\t"
                "str    r2, [r1,#0x64]\n\t"

                // Delay for 8 cycles
                "b 1f\n\t"
                "1: b 1f\n\t"
                "1: b 1f\n\t"
                "1: b 1f\n\t"
                "1:"

                "ldr    %[digit], [r1,#0x74]\n\t"  // remainder
                "ldr    %[value], [r1,#0x70]\n\t"  // quotient
                : [digit] "=r" (d)
                , [value] "+r" (n) : : "r1", "r2"
            );

            *cp++ = hex_table[d];
        } while (n);
Note that the inline assembly did not incorporate the macro value SIO_BASE, so I had to hand code it. This actually works fine and yields code as good as anyone could want.

Generalizing this (perhaps with an inline function) for integer division would not be hard, I have yet to investiage what gcc optimization would do with this. The generated code from objdump is as follows:

100002ba:       49a2            ldr     r1, [pc, #648]
100002bc:       660b            str     r3, [r1, #96]   @ 0x60
100002be:       220a            movs    r2, #10
100002c0:       664a            str     r2, [r1, #100]  @ 0x64
100002c2:       e7ff            b.n     100002c4 
100002c4:       e7ff            b.n     100002c6 
100002c6:       e7ff            b.n     100002c8 
100002c8:       e7ff            b.n     100002ca 
100002ca:       6f48            ldr     r0, [r1, #116]  @ 0x74
100002cc:       6f0b            ldr     r3, [r1, #112]  @ 0x70
Here we see that the compiler selected r3 to hold the value of "n" both coming in and going out. The copiler selected r0 to hold the value of "d" going out.


Have any comments? Questions? Drop me a line!

Tom's electronics pages / [email protected]