Veronica – GPU Recap

Posted: 4th March 2013 by Quinn Dunki in Veronica

By popular demand.

 

Housekeeping Note: I’ve added a new column on the left that groups Veronica posts together. If you want to follow this entire project, they’re all there, with the oldest at the bottom

 

So, now that the graphics board seems to be working (a good engineer never trusts anything completely), I can finally clean up the schematics and code a bit. People have been asking to see both for quite a while now, but you would not have wanted to see the pile of elephant vomit that was my project folder for the past 4 months. It’s still not pristine, but I think it’s at least constructive to share at this point.

First things first, the updated schematic for the VGA board (and the Eagle file):

..

Still cluttered, but you get what you pay for.

 

The primary difference from the earlier version of this board is the replacement of the 74HC573 latch with the 7200L FIFO. I do not have an updated PCB layout for this board, as I haven’t gone in and rerouted all the traces for this new version. I don’t plan to do that unless I need to re-etch this board for some reason. If you want the old BRD file as a starting point, it’s still available here.

 

Now, on to the firmware code for the GPU. It’s a single AVR assembly file, with a few parts broken out into .h files which are included for clarity. The makefile is included, which should work in your IDE of choice. It relies on GNU’s m4, so be aware of that if you plan to build this code. This code will not build as-is under AVR Studio, or other non-GNU AVR environments. If you’re on any flavor of un*x system, you’ll have m4 already, and assuming you have the GNU AVR toolchain installed, it should just build and go.

Here’s the complete code package. I’m not going to cover every line in gory detail- I refer you to the comments for that. I’ll just hit a few highlights here. Apologies for the funky indenting in the code samples below. I’ve yet to deduce the voodoo to get github-gists to indent the way I want.

The meat of the firmware is, of course, the VGA signal generator. That’s been covered in gory detail before, so I’m not going to go over that again. I will mention that I’ve changed my approach slightly. The VGA signals are still bit-banged out of PORTB, but instead of computing the various states, I now read them from a look-up table. The table looks like this:

// Frame, VSync, VBL, VRAMH
scanLines:

.byte 0,1,7,0x00    	// 00
.byte 0,1,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00		// 20
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,5,0x00

...

.byte 0,0,0,0xEF
.byte 0,0,0,0xEF
.byte 0,0,7,0x00    	// 517
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00
.byte 0,0,7,0x00		// 524
.byte 1,0,0,0x00		// End of frame marker. This does not count as a scanline

As the scanline interrupts fire, the Z register is used to walk through that table line-by-line. Each line provides all the state needed for that pass- the state of the vertical blanking, the VRAM address of the line, and so on. This saves a ton of registers and computation, and makes the code quite tidy. Rather than saving and restoring the Z-register every time the interrupt fires, I just globally reserve it. No other code may use it. I do the same for a number of other registers. The ATMega has enough registers to get away with this, and it saves a lot of cycles in the interrupt service routine (which it sorely needs). Note that I have to use the Z register here, because I’m indexing into program memory (where the scanline table lives). Only the Z register can be used for indirect addressing into program space. If this becomes a major problem, I can copy the scanline table into SRAM (using the .data segment). I currently do this with the font data, so that I can indirect into it with the X register for rendering.

The second core component to the firmware is the command buffer handling. When the 6502 issues rendering commands (by writing to system memory address $EFFF), the VGA board intercepts those bytes and pushes them into the hardware FIFO. During vertical blanking, when the GPU isn’t doing much else, it pulls commands out of this FIFO and processes them. Here’s the code:

mainloop:

    // Wait for VBL to process commands
	sbrs	VBL,2		// Bit 2 of VBL register is our rendering window
	rjmp	mainloop

	// Poll the FIFO to see if there's a command pending
	in		accum,PINB
	sbrs	accum,fifoReady
	rjmp	mainloop

	// Set up Port D as an input
	ldi		accum,0x00
	out		DDRD,accum
	ldi		accum,0xff
	clr		accum
	out		PORTD,accum
	nop						// Make sure PORTD has changed direction

	// Read from the FIFO
	sbi		PORTB,vramOE	// Kick VRAM off the bus so we can read the FIFO
	cbi		PORTB,fifoRead
	nop						// Allow for FIFO data setup time
	sbi		PORTB,fifoRead
	in		regS,PIND
	cbi		PORTB,vramOE	// Give bus back to VRAM

	// See if this is a new command, or completing a previous two-byte packet
	lds		accum,CMDBUFFER
	cpi		accum,0
	brne	mainHaveParamByte
	sts		CMDBUFFER,regS

	rjmp	mainloop

mainHaveParamByte:
	sts		CMDPARAM,regS
	lds		regT,CMDBUFFER
	clr		accum
	sts		CMDBUFFER,accum

	cpi		regT,NUM_COMMANDS
	brge	mainloop		// Illegal command, ignore it

	// Bounce off a jump table based on command value
	lsl		regT
	ldi		XL,lo8(pm(renderJumpTable))
	ldi		XH,hi8(pm(renderJumpTable))
	clr		accum
	add		XL,regT
	adc		XH,accum

	push	XL		// Using ret, trick AVR into jumping to an indirect address
	push	XH
	ret				// Note that we're not using ijmp, because we can't use Z

renderJumpTable:
	nop
	rjmp	mainloop
	rcall	renderFillScreen
	rjmp	mainloop
	rcall	renderPlotChar
	rjmp	mainloop
	rcall	renderPlotStr
	rjmp	mainloop
	rcall	renderFontColor
	rjmp	mainloop

In a nutshell, it waits for the VBL, polls the inverted Empty Flag on the FIFO (“not empty” means at least one byte is in there) and springs into action if needed. To read the command byte, it takes control of the VRAM bus (which is shared with the FIFO), reads the byte by twiddling the control lines of the FIFO, and stores the byte in internal SRAM. When two bytes have been read (forming a two-byte render command packet), the command is processed.

The first byte of the command packet is an index into a jump table containing pointers to the various render routines. Because of how I’m doing the pointer math here, I’m ultimately limited to 127 different render commands. Surely that will be enough. As you can see, there are currently only four- filling the screen, plotting a character, changing the font color, and plotting a “string” (which just means plotting a character and advancing the cursor). One trick to note here- the ATMega has the opcode ‘ijmp’, intended for things like jump tables. However, it requires the Z register, which is reserved by the interrupt handler. Instead, I’m using a very old school trick to jump indirectly- I calculate the address I want to jump to, push it on the stack, then ‘ret’ (return from subroutine). The ‘ret’ code pulls a two-byte address of the stack and jumps to it. You can’t modify the program counter directly on AVRs, but you can easily fool it into modifying itself the way you want.

The rest of the code is pretty boilerplate stuff, I think. I’ll show one rendering command, just so you can see what those look like. Here’s plotting a character:

/////////////////
// Plot a single character
//
// Parameter: The ASCII value to plot
//
renderPlotChar:

    // Take control of VRAM
	EnableVRAMWrite

	lds		accum,CMDPARAM
	subi	accum,0x41			// Offset from start of ASCII
	brcc	plotCharASCII		// A character we can render?
	rjmp	plotCharDone

plotCharASCII:

	// Compute pointer to desired character
	ldi		XH,FONTADDR_H
	ldi		XL,FONTADDR_L

	ldi		regS,4				// Offset by 4 bytes per char
	mul		regS,accum
	add		XL,r0
	adc		XH,r1

	// Compute high VRAM address for Y cursor position
	lds		regT,CURSORY
	lsl		regT
	lsl		regT
	lsl		regT

	out		PORTC,regT
	ldi		regT,CHARBYTES

plotCharOuter:

	// Compute low VRAM address of X cursor position
	lds		regS,CURSORX
	lsl		regS
	lsl		regS

	// Load two rows of character from font
	ld		regU,X+

	// 0 - Render odd row, leftmost pixel
	out		PORTA,regS
	clr		accum
	sbrc	regU,7
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	inc		regS

	// 1
	out		PORTA,regS
	clr		accum
	sbrc	regU,6
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	inc		regS

	// 2
	out		PORTA,regS
	clr		accum
	sbrc	regU,5
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	inc		regS

	// 3
	out		PORTA,regS
	clr		accum
	sbrc	regU,4
	lds		accum,FONTCOLOR

	out	PORTD,accum
	PulseVRAMWrite

	inc	regS

	// Next line
	in		accum,PORTC
	inc		accum
	out		PORTC,accum
	subi	regS,0x4

	// 0 - Render even row, leftmost pixel
	out		PORTA,regS
	clr		accum
	sbrc	regU,3
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	inc		regS

	// 1
	out		PORTA,regS
	clr		accum
	sbrc	regU,2
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	inc		regS

	// 2
	out		PORTA,regS
	clr		accum
	sbrc	regU,1
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	inc		regS

	// 3
	out		PORTA,regS
	clr		accum
	sbrc	regU,0
	lds		accum,FONTCOLOR

	out		PORTD,accum
	PulseVRAMWrite

	// Next pair of lines
	in		accum,PORTC
	inc		accum
	out		PORTC,accum
	dec		regT
	breq	plotCharDone
	rjmp	plotCharOuter

plotCharDone:

	// Clean up and we're done
	DisableVRAMWrite
	ret

That’s just an unrolled loop to iterate through the bits in a character’s bitmap, and store the “font color” byte into VRAM for every set bit. This version forces a black background, but I also have a transparent version which just skips rendering for 0 bits in the character. The latter is slower, since it has a lot more branching in it. I may add the ability to set a text background color instead.

That’s all there is to it! It’s really not a lot of code. It will be interesting to see if this system holds up to real world use as I start to build out the other areas of Veronica, and start to write real software.

  1. K. Scharf says:

    Strange to ask at this late date but have you seen the uzebox http://belogic.com/uzebox/index.asp ? This gizmo sorta kinda does what you are doing with your VGA graphics generator, but by (grossly) overclocking the AVR he is ableto generate higher-rez graphics and drive an SVGA generator chip. While overclocking isn’t generally a great idea, maybe you could pump out a full 640×400 image with a faster clock rate. There is also the option of the 33mhz (and they overclock to at least 40mhz) xmegas.

    • Quinn Dunki says:

      Yep, I have indeed seen it. A number of people seemed to have had the same idea (bit-banging VGA from an AVR) around the same time, and uzebox is one of the results. LucidScience.com (which put me onto this idea) is or was also working on a game console or demo box of some sort based on the same premise.

      At some point, if I decide to step up to better graphics, I think an FPGA will be in order. This stunt with generating video signals in software is neat, but for a lot of reasons is not a great way to do things. It was (relatively) quick and easy to get going with the knowledge I had, which is the main reason it was the right choice for Veronica.

      • Ken Scharf says:

        Looking back at your Veronica project made me realize I have a similar itch to scratch, IE: to build a ‘retro’ computer from back in the days when it all begain. I used to have both a KIM-1 and a homebrew 6502 machine with the ‘TIM’ monitor built on a set of OSI 400 series boards. I was one of those geeks standing in line at the MOS booth to buy a $25 6502 chip at the Atlantic City computer convention back in the mid ’70s. (Who knows Steve Wazniak could have been in line with me).

        Looking in my Junque box I have lots of Z80’s, a few 68000’s and some 80186’s that look tempting. The two ‘prizes’ though are an AMD AM9080A (8080 clone) and a D.E.C. T-11 chip. (The T-11 is a PDP-11 single chip processor that is a clone of the PDP-11/20).

        I could build an Altair/Imsai clone and use an ATmega to do the front panel. (Just think, the processor in the PANEL having 10X or more the power of the computer!). I’d also like to re-live my days working at DEC and build a PDP-11 around that T11 chip, again doing a front panel in software with an AVR talking to the T-11 via shared memory or a fifo chip.

  2. JCCyC says:

    Do you plan to implement some way of reading the framebuffer content?

    • Quinn Dunki says:

      I have no plans to do that at the moment. I don’t think it will be necessary for the kinds of things I want to do, and it would be pretty difficult to implement. This whole system is predicated on the notion that data transfer is one-way. That allows a lot of assumptions that simplify the firmware and hardware design.

      • JCCyC says:

        I see. So the fun resides in making code to implement “smart” commands at the AVR side. Draw lines, circles, bucket fill, patterns, scroll, bitblts, maintain sprites, do collision detection etc etc etc… all of that can be one-way.

        Ah. Apropos of absolutely nothing, I saw this and immediately thought of you: http://cdn.memegenerator.net/instances/250×250/36017046.jpg

        • Quinn Dunki says:

          Indeed. It’s basically the model of modern graphics accelerators. You push rendering commands and data (vertices, textures, etc), and the card does the rest. Modern GPUs do usually support reading pixels from the frame buffer in a limited way, but it’s always very expensive and stalls the entire rendering pipeline. That feature is usually only present because the API standards (such as OpenGL) require it, and it’s quite kludgy on the hardware side. It’s sometimes useful for debugging, but is generally too slow to be of much real use. In the early days of accelerated rendering, it was sometimes used for side-buffer picking algorithms in isometric games, and as a workaround for systems that didn’t support render-to-texture for things like reflection mapping. Nowadays, there are better ways to do both, so reading pixels from the frame buffer is pretty much a no-no.

          • JCCyC says:

            > That feature is usually only present because the API standards (such as OpenGL) require it

            …and because people like to be able to do a Print Screen.

            • Quinn Dunki says:

              Yes, although generally you want to do your rendering to an offscreen buffer in that case. Screenshots (for example, for press) are usually rendered at resolutions higher than the frame buffer can hold anyway. There will usually be a 4x or 8x scale applied, and the screen will be rendered in tiles and composted offline in a very large image suitable for magazine or poster printing. Aside from debugging the rendering pipeline itself, there’s not much reason to take screenshots directly from the frame buffer.

  3. kscharf says:

    While you’re working on your GPU command set and fonts here is an idea … variable width fonts. It seems that most characters can be rendered in a puny 3 pixel wide font, but some (N,M,W) look UGLY. OTOH an ‘i’ can get by in only ONE pixel wide. So if you make your font table a bit larger (to have pixel width per character info) you could implement a variable width font set which on the average migth be something like 3.2 pixels per char wide over typical text. I used this technique when I developed the graphical text driver for Niles Audio’s Iremote which had a 160 pixel wide screen. Normally it would allow 20 characters per line (8×8 cell), but the variable width technique allowed something closer to 30.

  4. ryemac3 says:

    This project is truly amazing. It’s just such a shame that you sealed it up in such a box! I’d put it in an acrylic cube so that the world can see the amount of effort that was involved. Sometimes, it’s what’s on the inside that counts too!

    • Quinn Dunki says:

      Well, I wanted to go for a retro feel. However, the guts do pull out and the machine can be run with everything sitting outside (the cables to the control panel are all long). If I was going to display it somewhere at a conference or something, that might be what I’d do. Still, I like your acrylic cube idea. It’s not too late… hmmm….

    • JCCyC says:

      Are you kidding? That tube radio is the most awesome computer case in the history of ever. Though the transparent acrylic box idea is cool.

      And now for something completely different: Just as we were talking about the “write only” video, Ed S of the G+ Retro Computing group was talking about Tektronix graphics terminals. Maybe (a subset of) those escape sequences could be implemented in the AVR? https://plus.google.com/u/0/107049823915731374389/posts/TqATnjsXK7T

  5. Rhialto says:

    In an acrylic cube you’d get vORACnica or something like it 🙂

  6. tomballarino says:

    Hello,

    I really dig your site, especially the Veronica-stuff. I had so much fun reading your posts. In fact, I started my own site about a DIY computer projects I began a while back, but never came to completion, because of the inspiration I got from you. It can be found at http://blog.ballarino.org
    The only thing that partly works is the GPU, which I dedicated my first post to.

    Cheers
    Tom