The Assembler
My friend Chris and I recently finished our USB project. I was trying to think of a good way to present this in a post, and decided to highlight one small part of the project: the assembly that drives the lowest layers of the software stack.
Old School
Last semester I had hacked up the assembly for our project without much foresight or care for elegance. We were more preoccupied with trying to grasp the 650-page spec that is USB, and beautiful code was the least of our worries.
Though the code didn’t look good, it still had to be good, and here’s what it had to accomplish:
-
Output one bit from a buffer in memory every 10 cycles (no more, no less).
-
Keep track of the number of bits sent and stop when that is equal to a given bit count.
-
Every eight bits sent load another byte from memory.
-
Raise and lower the enable line so that the Mega32 can drive the bus when transmitting and read the bus when receiving.
This becomes a tall task when you only have 9 cycles to work with (one cycle is needed to output on the I/O pins). Let’s look at what the assembly I wrote last semester for transmitting a packet looks like:
#define SIE_TOKEN_BIT
mov r20, r3
andi r20,0x01
add r20, r10
out %1, r20
lsr r3
subi r16, 1
brne .+4
jmp .sie_send_token_eop
/* Token: Send bit, and nop until next one */
#define SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT
NOP2
/* Token: Send bit, and buffer another byte */
#define SIE_TOKEN_BIT_BUFFER
SIE_TOKEN_BIT
ld r3, X+
The first thing to note is that all loops were unrolled in this assembly. To send a byte, we had this macro:
#define SIE_TOKEN_BYTE
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_NOP
SIE_TOKEN_BIT_BUFFER
This would literally copy and paste in the assembly above about eight times. And that wasn’t even the worst of it: because the loops were unrolled, we had to copy in enough loop iterations for the worst case scenario. For the data packet transmit code — which at most could send 103 bits — we had the byte macro copied in 13 times. This equated to about 936 lines of code — for just one small part of the SIE code. The sheer size of this hurt us; our code weighed in at about 20 KB when compiled. On a device with only 32 KB of program flash memory this becomes a bit of a problem (considering that our code was intended as a companion library to an existing user program).
The assembly above shouldn’t be too confusing, but a few notes are in order.
r3
is used to store the byte buffered from memory.r20
is used as a temporary register.r10
holds the value0x05
.%1
is a compiler directive forPORTA
.r16
holds the bit count for the given packet.
Each of the lines are explained in order:
-
mov r20, r3
Copies the buffer register into the temporary register.
-
andi r20,0x01
Performs an AND operation on the temporary register with a bit mask to extract the lowest bit. This is the bit that will be sent on the bus.
-
add r20, r10
Adds the bit to be sent with 5: what this is essentially doing is differentially encoding the signal and setting the enable pin high at the same time. The pin assignments were: enable on pin 2, D+ on pin 1, and D- on pin 0. Thus if the bit in the temporary register is 1, meaning that we should be sending a differential 1, adding 5 will yield
0b00000110
. The enable line is set high, as is D+. D- is low. -
out %1, r20
Outputs the value of the temporary register on
PORTA
. -
lsr r3
Shifts the buffer register down one, getting it ready for when the next bit will be sent.
-
subi r16, 1
Decrements the bit count by one.
-
brne .+4
When the bit count hits zero, this branch will not be taken.
-
jmp .sie_send_token_eop
If the above branch is not taken — meaning that all bits have been sent — then the code will jump to the end-of-packet handler.
Repeat this seven times, and add a load instruction on the eighth, and you have the complete workings of last semester’s assembly. It worked, true, but it was gross; and with an entire semester to rework stuff I decided to sit down and hammer out some nice code.
New School
We came into this semester knowing that USB with a Mega32 is indeed possible. We also knew what USB was. We figured that a full code overhaul would be in order, and there’s no better place to start than at the bottom.
The assembly, from above, was completely tossed. Little by little we came up with our new assembly — replete with rolled-up loops and clever hacks. These changes required some small modifications to the hardware, but nothing major.
Here’s the code that made it into our final revision:
#define TX_PACKET(label,mem_pointer,bit_count_reg)
mov r5, __zero_reg__
ldi r20, 0x01
/* buffer */
.sie_#label_tx_buffer:
ld r10, #mem_pointer+
/* bit tx */
.sie_#label_tx_bit:
lsl r10
rol r20
out %4, r20
/* completion checks */
dec #bit_count_reg
breq .sie_#label_tx_done
add r5, r3
brcs .sie_#label_tx_buffer
ldi r20, 0x01
rjmp .sie_#label_tx_bit
/* done */
.sie_#label_tx_done:
I could try to explain all the assembly here, but I’d be repeating an entire chapter of our documentation on the project. Chapter 4 of the documentation covers each line of the assembly and how it works. Check it out, or try and figure out what the assembly is doing on your own. Should you choose to do that, know that:
-
The pin assignments are: enable on pins 1 & 2 (these get ORed together into one enable line) and transmit on pin 0 (this gets differentially encoded by the hardware).
-
%4
is a compiler directive for the USB port. -
#label
,#mem_pointer
, and#label
are parameters passed to the macro and are pasted into the macro where needed before the code is compiled. -
r3
holds the value0x20
.
Explanation aside, the punchline is that nearly 1000 lines of assembly in our previous project got replaced with just 12 lines of cleverness.