Octo

Adventures In Sorting

Sorting is one of the most fundamental classes of bread-and-butter algorithms in Computer Science. Implementing a sorting algorithm on the Chip8 has some interesting challenges. Registers are plentiful, Memory access is cumbersome and recursion is impractical. The Big-O complexity of algorithms tells us how well they should perform for some arbitrary size N, but once we tie ourselves to a specific ISA and an upper limit on practical Ns the story can change. Let’s look at a few approaches.

Performance comparisons will be based on a common, arbitrarily chosen shuffled array- in practice different inputs will produce different results but one trial gives us a coarse idea of how algorithms stack up:

:const SIZE   16
:const SIZE-1 15
: data  14 5 15 6 1 3 10 7 0 9 11 4 2 13 8 12

The O(n^2) sorting algorithms are often very compact and don’t have much overhead. I went with a selection sort for starters, because the linear scans it involves lend themselves well to Chip8 memory operations:

:alias  here       v1
:alias  rest       v2
:alias  min-index  v3
:alias  min-value  v4
:alias  here-value v5

: selection-sort
	here := 0
	loop
		min-index := here
		i := data
		i += here
		load v0
		min-value  := v0
		here-value := v0

		rest := here
		rest += 1
		i := data
		i += rest
		loop
			load v0
			if v0 >= min-value then jump no-better
				min-index := rest
				min-value := v0
			: no-better

			rest += 1
			if rest != SIZE then
		again

		if min-index == here then jump no-swap
			v0 := here-value
			i := data
			i += min-index
			save v0

			v0 := min-value
			i := data
			i += here
			save v0
		: no-swap

		here += 1
		if here != SIZE-1 then
	again
;

I use a few tricks here. I access the pivot value for each pass up-front and carry it in registers. The loop which finds the minimum value takes advantage of how load increments i, removing the need to re-initialize it and add an offset on each loop iteration. The memory swap at the end of each iteration is fairly bulky, even though I’ve carefully arranged so that both values will already be in registers.

The algorithm weighs in at 70 bytes and takes roughly 1260 cycles to sort our 16 element array. It’ll be hard to write a general-purpose sorting routine that consumes less memory, but we can improve the speed.

The next natural thought is to employ one of the O(n * lg(n)) algorithms. Quicksort seems to be right out because it is naturally implemented recursively and Chip8 doesn’t have an argument stack. We could build one, but the overhead doesn’t sound good. A better choice is Heapsort:

:const LIMIT   7 # (SIZE/2)+1

:alias left-val  v0
:alias right-val v1
:alias start     v2
:alias root      v3
:alias end       v4
:alias best      v5
:alias left      v6
:alias best-val  v7
:alias root-val  v8

: heap-sort
	start := LIMIT
	end   := SIZE
	loop
		root := start
		sift-down
		start += -1
		if start != -1 then
	again

	start := SIZE-1
	loop
		# swap data[0] with data[start]:
		i := data
		load v0
		vf := v0

		i := data
		i += start
		load v0
		i := data
		save v0

		i := data
		i += start
		v0 := vf
		save v0

		start += -1
		root := 0
		end := start
		sift-down

		if start != 0 then
	again
;

: assign-best
	best     := left
	best-val := left-val
	jump found-best

: sift-down
	i := data
	i += root
	load v0
	root-val := v0

	loop
		left <<= root

		if left > end then return

		i := data
		i += left
		load v1

		best := left
		best += 1
		best-val := right-val

		if left-val > right-val then jump assign-best
		if left == end then jump assign-best
		: found-best

		if root-val >= best-val then return

		i := data
		i += root
		v0 := best-val
		save v0

		i := data
		i += best
		v0 := root-val
		save v0

		root := best
	again

This one was a little tricky to get debugged. Unfortunately, it’s worse on both dimensions we’re considering. The code takes up 130 bytes and takes 1941 cycles to sort the same 16 element array.

If we scale up the array to 64 elements the insertion sort takes about 17613 cycles while the heapsort is only about 11825. Algorithmic complexities work out as expected, but either approach is impractically slow.

There’s a different way to approach this problem. If we have a small, fixed N we could aggressively unroll the steps in a normal sorting algorithm, producing a sequence of code snippets which look something like this:

if v0 <= v1 then jump l-0
	vf := v0
	v0 := v1
	v1 := vf
: l-0

note: this was written prior to the addition of if…begin…else…end.

By using a load vf instruction it would be possible to pull the full 16 elements of an array into registers at once, and then this lattice of comparisons and swaps could take care of the rest. That won’t quite work, though- comparing the magnitude of two values requires a subtraction, which will destroy the contents of vf as well as one other temporary register. As a result we’d only be able to sort 14 elements in this manner. The next lowest power of two is 8.

What is the best sequence of comparisons and swaps? The obvious approach is to unroll the steps of a Bubble Sort- for size N=8 that would mean 28 swaps. This is a loose bound, though- this site describes optimal sorting networks for N <= 16. Here’s one for N=8 that only requires 19 swaps:

Sorting Network

This diagram is read left-to-right. Horizontal lines are values in a given position of an array, and vertical lines are a test-and-swap between two values. Vertical lines which are in the same column are swaps that can be carried out simultaneously or in any order and otherwise the test-and-swaps must be carried out left to right. Translating this into Octo code as in the example above, we get a subroutine called sort-8 which is 268 bytes long and takes somewhere around 100 cycles to do its thing. This is fast enough that it could actually be employed in a Chip8 game if called sparingly!

For a fair comparison with the previous two algorithms, we can sort 16 elements by splitting the source array into two halves, using the sorting network on each and then performing a linear merge pass:

: heap1 0 0 0 0 0 0 0 0
: heap2 0 0 0 0 0 0 0 0

:alias val1   v8
:alias val2   v9
:alias dest   va
:alias index1 vb
:alias index2 vc

: fused-sort
	i := data
	load v7
	sort-8
	val1 := v0
	i := heap1
	save v7

	i := data
	load v7
	load v7 # cheaper than adding an offset of 8
	sort-8
	val2 := v0
	i := heap2
	save v7

	index1 := 1
	index2 := 1
	dest   := 0

: merge
	if val1 > val2 then jump merge-2
	append-1
	if index1 != 9 then jump merge
	loop
		append-2
		if dest == 16 then return
	again

: merge-2
	append-2
	if index2 != 9 then jump merge
	loop
		append-1
		if dest == 16 then return
	again

: append-1
	v0 := val1
	i := data
	i += dest
	save v0
	dest += 1

	i := heap1
	i += index1
	load v0
	val1 := v0
	index1 += 1
;

: append-2
	v0 := val2
	i := data
	i += dest
	save v0
	dest += 1

	i := heap2
	i += index2
	load v0
	val2 := v0
	index2 += 1
;

This could be done in-place but I used separate buffers for the sake of conceptual simplicty. The results are impressive: 418 bytes of code and a finished sort in about 479 cycles, completely blowing our previous attempts out of the water at the cost of some precious ram. If this is the best balance we can strike, I hope nobody needs to write a game that requires a routine like this, though- if we use what I believe is a historically accurate clock speed for Chip8 our fastest algorithm takes a little over a second to run.

Part 2

I’ve been tinkering with array languages recently and it lead me to realize that I entirely skipped over a useful class of sorting algorithms in my earlier investigation- Counting Sorts.

There are several variations on the idea, but the classic Counting Sort does a single scan of the input array, counting instances of each key in an array of “buckets” which correspond to each value the key could take on. From this histogram a second pass can read values back out in sequence and fill the source array. It’s possible to achieve linear-time sorting in this manner, but it is only practical when the range of values to consider is fairly small.

For purposes of illustration I’ll restrict the problem space to sorting nybbles (0-15)- as a result the cycle counts will not be directly comparable to those of the previous examples. Our first task is to declare and initialize our bucket array:

: empty
	0 0 0 0 0 0 0 0

: buckets
	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

: bucket-sort
	# zero bucket array
	i := empty
	load v7
	save v7 # unrolled loop to zero bucket array
	save v7 # and initialize loop counters for later.

	...

I’m using several tricks here- I fill the lower 8 registers from an ‘empty’ buffer and then take advantage of the fact that i will automatically be incremented to the head of the buckets array and then perform two writes to “stamp” zeroes over the entire array. I could save an instruction if my empty array was large enough to zero all 16 registers, but it’s a code size tradeoff. For zeroing a larger bucket array I might do something like this:

: empty
	0 0 0 0 0 0 0 0 0

: buckets
	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

: bucket-sort
	i := empty
	load v8 # also initialize v8 as 0
	loop
		save v7     # zero 8 bytes
		v8 += 1
		if v8 != 32 # 8x32 = 256
	again
	...

Either way, fairly fast. We’re able to make those autoincrements work for us. Summing the histogram is much clumsier as we have to juggle i between arrays:

	...
	# bucket the elements
	# v1 is data index (already 0)
	loop
		# load data
		i := data
		i += v1
		load v0
		vf := v0

		# increment bucket
		i := buckets
		i += vf
		load v0
		v0 += 1
		i := buckets
		i += vf
		save v0

		v1 += 1
		if v1 != SIZE then
	again
	...

And finally, unpacking runs. This is a bit clumsy, but we’re able to build a very tight writing loop using a technique similar to our earlier memory fill. the longer our runs, the more this will pay off.

	...
	# unpack bucket counts
	# v1 is temporary count
	# v2 is data   index (already 0)
	# v3 is bucket index (already 0)
	loop
		i := buckets
		i += v3
		load v0
		if v0 == 0 then jump no-write
			v1 := v0
			v0 := v3
			i  := data
			i  += v2
			v2 += v1
			loop
				save v0
				v1 += -1
				if v1 != 0 then
			again
		: no-write
		v3 += 1
		if v3 != MAX_VAL then
	again
;

Notice how overall I avoid initializing index registers explicitly by taking advantage of the fact that they will be zeroed as part of my initial memory fill routine.

Performance is impressive. The counting sort handles our example 16 element loose-packed nybble array in only 471 cycles, while our best previous (and slightly more general) attempt took 481, using only 122 bytes instead of 422. Increasing the range of sortable values rapidly increases memory requirements, but there are many possible tradeoffs and tweaks. Overall it looks like this type of algorithm provides a second viable option for running sorts on real hardware. See the entire program together.