Bare-Metal Coding: Software Optimization on the 6502 CPU

Welcome to my first blog post on Software Portability and Optimization (SPO 600)! Through this series, I’ll share my learnings and experiences as I explore the world of low-level programming and optimization. We will start by working on a 6502 emulator, which will lay the foundation for understanding more complex architectures like x86_64 and AArch64. Let’s dive into it!

The reason we choose 6502 CPU is due to limited instructions, because it isn’t good idea to read 7000 pages of instructions just to get work on it. And Intrestingly it was the one that powered Apple II, Commodore and even the Nintendo Entertainment System.

The 6502 CPU can directly addresses up to 64KB of memory using 16-bit addresses. The CPU interacts with memory through various addressing modes: Immediate address, Zero-Page addressing, Absolute Addressing, Indexed Addressing.

We can begam writing basic programs for the 6502 using an emulator. Here, How it actually looks:

Now, we will play with the following code in the 6502 emulator.

lda #$00    ; set a pointer in memory location $40 to point to $0200
    sta $40     ; ... low byte ($00) goes in address $40
    lda #$02    
    sta $41     ; ... high byte ($02) goes into address $41
    lda #$07    ; colour number
    ldy #$00    ; set index to 0
 loop:  sta ($40),y ; set pixel colour at the address (pointer)+Y
    iny     ; increment index
    bne loop    ; continue until done the page (256 pixels)
    inc $41     ; increment the page
    ldx $41     ; get the current page number
    cpx #$06    ; compare with 6
    bne loop    ; continue until done all pages

This code will fill the screen up yellow colour. But we have to dive in a little deep to calculate the how long does this code take to execute assuming a 1MHz clock speed and calculate the memory usage.By analysis, this code should take 0.11319 seconds and 26 bytes to execute.

Image description

Image description

Image description

Optimization

How we can literally slices the execution time by more half? This is what optimized code looks like.

lda #$07    
ldy #$00
loop:   
  sta $0200,y
  sta $0300,y
  sta $0400,y
  sta $0500,y
iny 
bne loop

New and different colour

To change the colour from Yellow to Pale Orange, we need to update whats on the accumulator.

lda #$07 ; old
lda #$08 ; new

To get different colours in every quater, we can explicitly specify the colour we want to use.

ldy #$00
loop:   
  lda #$12
  sta $0200,y
  lda #$13
  sta $0300,y
  lda #$14
  sta $0400,y
  lda #$15
  sta $0500,y
iny 
bne loop

To get random colour in each pixel, we can use the one-byte pseudo-random number generator (PRNG) at $fe.

ldy #$00
loop:   
  lda $fe
  sta $0200,y
  lda $fe
  sta $0300,y
  lda $fe
  sta $0400,y
  lda $fe
  sta $0500,y
iny 
bne loop

Challenges

We were challenged with writing a program which draws lines around the edge of the display:

  • A red line across the top

  • A green line across the bottom (which I make it purple for no special reason)

  • A blue line across the right side.

  • A purple line across the left size.

lda #$07 
ldy #$00
loop:   
  sta $0200,y
  sta $0300,y
  sta $0400,y
  sta $0500,y
  iny 
bne loop

lda #$2
loop2:   
  sta $0200,y
  iny 
  cpy #$20
bne loop2

lda #$5
ldy #$e0
loop3:   
  sta $0500,y
  iny 
bne loop3

lda #$6
ldy #$e0
loop4:  
  sta $0500,y
  iny 
bne loop3

Reflection

It’s really intresting to see how colors get changed with simple assembly instructions. But, personally, i didn’t found it to be simple, there were lot of hit and trials. It was for good, by this i get one step more deep in understanding how registers and memory interact and how performance is executed by instruction execution time. And, we learnt 6502 processor too: syntax.