Alexander G. Dean, Benjamin Welch, Shobhit
Kanaujia
Center for Efficient, Scalable and Reliable Computing
Dept. of Electrical and Computer Engineering
Generating video signals for CRTs or LCDs in software on simple processors with minimal hardware support has been a popular project among microcontroller enthusiasts. Pioneers in the 1970s include Don Lancaster [18] and the Timex/Sinclair 1000 (Sinclair ZX81). More recently Ricard Gunee [17], Robert Lacoste [15] [16], Bruce Land [20], Alberto Riccibitti [22], Eric Smith [23] and David Thomas [21] have demonstrated systems. There are various useful websites with surveys and tutorials [17][19] [24]. The recent systems use PIC, Ubicom (formerly Scenix) and AVR processors due to their predictable performance and sufficiently fast clock rates. These developers performed manual timing analysis of code and scheduled video operations with nop (“no operation”) instructions to fill in idle time within a video line. These nops account for a large fraction of the CPU’s time (59% for STIGLitz), wasting processing capacity and leaving little time for other code to execute (0.5 MIPS for Land [20]).
Some people have manually recovered portions of this idle time for other work by manually moving instructions in an ad hoc (or “add hack”) manner. This reclamation effort is not limited to display applications, but shows up in other applications which need fine-grain concurrency, such as communications systems like modems [25] and network interfaces. These individuals usually describe their work with discouraging terms such as “huge effort” [16] and “difficult to write and debug” [25]. Our approach brings a method to this madness by using modern compiler technology to create code which executes useful system work in place of these nops.
In order to use a single processor effectively, it must be shared, for example with a scheduler, a real-time operating system, and/or interrupt service routines. Most processors are poor at simulating concurrency, as each context switch requires time- or event-based triggering as well as the actual switching. This overhead time rises as switches become more frequent, reducing processor MIPS available for application code.
We have developed methods to merge functions from multiple threads into one function, giving multithreaded concurrency on a simple uniprocessor. This software thread integration (STI) lets us use cheap, common microcontrollers instead of faster, more expensive ones. We have been building these methods into our optimizing compiler back-end (called Thrint) to help squeeze performance out of generic processors for applications with fine-grain concurrency such as controllers for video refresh and embedded networks. This article describes our STIGLitz project, in which an Atmel Atmega 128 8-bit microcontroller running at 20 MHz generates a monochrome NTSC video signal with STI and simple hardware. It provides a 256 by 240 pixel frame-buffer-based display with two bits per pixel and rendering of lines, circles, sprites and text. It reclaims the time between pixel-output operations for graphics primitive rendering and serial communication use, increasing the system’s graphics performance by a factor of 4x to 13x.

Figure 1 shows the processor utilization for our system under various conditions. The leftmost bar (No Integration) shows that before applying STI, the MCU spends most of its time refreshing the display or executing the nops between video output instructions. This leaves only 0.36 MIPS for foreground processing. The four bars to the right demonstrate processor utilization when rendering various types of lines. STI reclaims large amounts of idle time, providing 1.3 to 4.5 MIPS of line rendering and 2.1 MIPS of serial communication processing instead. Some time is wasted in the dispatcher or context switching, while some is lost because STI integration is not completely efficient when dealing with unpredictable loops. This article describes how to integrate code using STI, describes how the STIGLitz video generation platform works, and how to use it.
Monochrome NTSC (RS170) video generation requires the generation of periodic synchronization (sync) pulses. Generating these sync pulses in software is a real-time problem; a late sync pulse deteriorates the picture quality. In this section we briefly review the NTSC monochrome signal to provide the reader with a good understanding of the real-time requirements in video generation.
An NTSC frame contains 525 scan lines and is composed of interlaced raster scan lines. Two sets of raster scans (called fields) are drawn per frame; we refer to these fields as even and odd fields. The fields are made up of 262.5 lines and consist of alternate lines on the screen (hence the term interlaced). Each scan line lasts for 63.5 microseconds, of which only 52.6 microseconds is the visible section and the remaining time is occupied by synchronization.
There are two types of synchronization signals, vertical and horizontal. These reset the scan to the beginning of the next field. As shown in Figure 2, vertical synchronization sequence is composed of three pre-equalization pulses, three vertical synchronization pulses and three post-equalization pulses. The vertical synchronization pulses occur at the field rate of 60 Hz.

Figure 1.
Vertical
Synchronization Pulses in an NTSC monochrome signal.
The horizontal sync signals reset the scan to the next line in the field and occur at the line rate of 15.75 kHz. Figure 3 shows an entire scan line of video, including the three parts of the Horizontal synchronization (front and back porch and the horizontal sync) and their respective durations.

Figure 2.
Horizontal
Synchronization Pulses in an NTSC monochrome signal.

STI works by merging two functions into one implicitly multithreaded function, as shown in Figure 4 [10][4][5][6][7][8]. A software implementation of hardware typically has some idle time between time-critical groups of instructions. When used for real-time software, STI enables the placement of time-critical instructions from one function so they execute at a given time relative to the beginning of the integrated function, regardless of the control or data flow characteristics of either thread. The function with internal idle time is called the primary or guest function, while the function with which it is integrated is called the secondary or host function.
We place various restrictions on the functions to be integrated in order to simplify integration and minimize the additional processing needed. No subroutine calls are allowed in either the primary or secondary functions. The primary function can have no local variables or arguments, as these would require merging the stack frames of both functions and modifying secondary function call sites. Interrupts within the system are disabled for the duration of the integrated threads, leading to increased latency. We recently have developed methods [13] to support interrupts through polling and cut latency to acceptable levels. In general, loops must have a known number of iterations. However, in many integration cases this restriction is lifted; this is described in more detail below. The integrated line-drawing functions all contain loops with unknown iteration counts.
Several steps are required before thread integration can begin. The first involves structuring the program so that work can be accumulated for an integrated function to perform later. In STIGLitz we use various queues to accumulate graphics primitive rendering work for later processing by secondary functions. The functions to be integrated will dequeue an item of work and process it; one example in STIGLitz is STIGLitz_Service_DrawDiagonalLine.
The second step is to write the functions which will be integrated. We used C for the graphics rendering functions (as well as most of the other code), and assembly language for the video refresh function. The functions should be debugged, optimized and tested as much as practical at this point. Functions which will be integrated must share the register file, so we partition the register file to restrict each function to using particular registers (see Table 1). This simplifies later steps in integration. There are more efficient methods for allocating registers, but we choose partitioning for practicality. C code which must be integrated is compiled to assembly code using a convenient command-line switch for GCC (e.g. –ffixed-r9) which prevents the use of specific registers.
Table 1. Register Use for Integrated Functions
|
Function |
Registers Used |
||
|
|
Pointer |
Immediate |
Other |
|
PumpPixel |
r26-27 (X) |
r19 |
|
|
UART_Tx, UART_Rx |
r30-31 (Z) |
r20, r24-25 |
|
|
Non-sprite graphics rendering functions (partial context switch) |
r28-31 (Y,Z) |
r16-18, r20-25 |
r0-7 |
|
Sprite graphics rendering functions (full context switch) |
r28-31 (Y,Z) |
r16-18, r20-25 |
r0-15 |
The third step is to perform static timing analysis on the primary and secondary functions to be integrated. This identifies the best-case and worst-case start times for each instruction and region of code. This is a tedious and error-prone chore, which is why we use our optimizing compiler back-end Thrint to perform the analysis on the assembly code (e.g. thrint –Gc –hp –hC file.s). This generates a graph description file which can be processed by the programs aiSee (Windows) or XVCG (Unix) [1] to create a diagram which includes instructions and cycle counts, as shown in Figure 5 and Figure 6.

Figure 4.
Thrint
automatically forms a control dependence graph of
STIGLitz_Service_DrawDiagonalLine, performs timing analysis and then pads away
timing variations in conditionals with nop instructions (indicated with grey
nodes). Thrint creates a graph
description file which is rendered by other tools (aiSee or XVCG) to create
this image.

Thrint uses a tree-based form called a control dependence graph (CDG) to represent the control-flow of code, rather than the traditional control-flow graph. This hierarchical form makes the code much easier to comprehend, analyze and modify for STI. Control dependence regions such as conditionals and loops are represented as non-leaf nodes, and assembly language instructions are stored in leaf nodes. Conditional nesting is represented vertically while execution order is horizontal (left to right). The type of edge connecting a node to its parent determines the condition under which it will be executed (always, when the condition is true, or when it is false). Program regions such as loops and conditionals as well as single basic blocks are moved efficiently in a coarse-grain fashion, yet instructions can be scheduled on a fine-grain basis (within nodes) as needed.
The fourth preparatory step is to pad uneven duration conditionals in both the primary and secondary functions with nop instructions so the program takes the same amount of time regardless of control flow path. Blocks of nops or loops can be used for padding. Thrint can perform this automatically as well (e.g. thrint –Gc –hp –hC –i file.s file_pad.id) but now requires an integration directives file (e.g. file_pad.id, Listing 1). Thrint creates an output file file.int.s with the padded function in assembly language.
TOLERANCE 0 CY
PROCEDURE STIGLitz_Service_DrawDiagonalLine DISCRETE_PADDED
END
Listing 1. Integration directives file for padding a function.
The fifth step in preparation is to evaluate the duration of the primary and secondary functions after padding, consider the idle time available in the primary function, periods of tasks, allowable interrupt latencies, and then determine an approach for integration. In some cases a primary function may need to run so often that there is not enough time to execute all of the secondary function before the next instance of the primary. In this case we partition the secondary function into segments slightly shorter than the available time and integrate the primary function into each segment. Figure 13 shows various long functions broken into segments before integration together in STIGLitz. On the first call, the dispatcher (Figure 12) invokes the function. At the end of each segment but the last a context switch with a coroutine call saves the function’s registers. As time becomes available for executing the integrated code, the function resumes execution with another context switch and coroutine call. In this manner the integrated function will execute segment by segment. Short-latency tasks such as ISRs are handled with polling servers. These polling servers are integrated one or more times into the primary function. More details on the segmentation and polling are available [13].
CLOCK_FREQUENCY 20 MHz
PROCEDURE main INTO main ENDS
TOLERANCE 50 ns
BLOCK EnableCounters AT 0.8 us
TOLERANCE 0 CY
LOOP PixelLoop PERIOD 800 ns ITERATION_COUNT 64
BLOCK PixelOut INTO_LOOP PixelLoop FIRST_AT 1.2 us
END
Listing 2. Integration directives file with target times for primary function.
We next define target time ranges (earliest and latest) for each region of primary code to begin execution. In the case of the video generation code, the earliest and latest times for a given region are equal, as any timing variation will lead to the shifting of pixels on the display and degrade image quality. An example of an integration directives file with timing targets appears in Listing 2.
At this stage we finally perform actual integration. First we present the basic integration methods which handle primary regions not contained in loops (called single-event code). Then we introduce methods for handling primary regions within loops (called looping-event code). These techniques allow integration of arbitrary code.
STI consists of moving regions of code (which could be single instructions) from the primary thread into the secondary thread to execute at the correct times, as presented in Figure 7a. Using the time-annotated CDG of the secondary code as a map, we recursively identify the proper location to place each primary region based upon its target time range. If this range overlaps with a gap between secondary nodes, we can place the primary region between those nodes. Otherwise the target time range is contained completely within a region, and the secondary code must be modified to open up a gap which overlaps with the target time range. If the secondary region is a code node, it can simply be split at the appropriate time. If the region is a conditional, the search for the proper gap is repeated on both the true and false conditional cases.

a) General transformations for integrating code b) Transformations
for integrating loops together
If the region is a loop, integration is handled as follows, shown in Figure 7b. A loop with a known number of iterations can either be split and peeled or left intact. Consider the case of a primary region which needs to begin execution in the middle of iteration 5 of a secondary loop. Splitting and peeling the loop involve duplication; the first copy of the loop executes the first four iterations, the fifth iteration is executed by a peeled-off copy of the loop body, and the remaining iterations are executed by the second copy of the loop. The search for the proper gap is now repeated on the peeled iteration of the secondary loop. The alternative to splitting and peeling a loop is to insert the primary region under a guard region (a conditional branch) which only executes the primary region on a specific iteration. Inserting this guard region requires search for a gap slightly before the target time range. The timing analysis for this approach is more involved, as inserting the guard code disrupts the previous timing analysis. For STIGLitz we use loop splitting and peeling.
If the primary code contains events within loops, the
following approach is used. First, the execution schedules of the primary and
secondary functions are compared, accounting for the fact that the secondary
function will only execute in the idle time of the primary function. Portions
of primary loops containing events that do not overlap with secondary are
unrolled and integrated as single events. If a primary loop containing events
overlaps with a secondary loop, the overlapping portions of the loops will be
fused.
STI can be performed on code with no more than one loop per thread with an unknown iteration count. This restriction is eliminated in the case of long secondary functions and short primaries. Here the secondary is broken into segments which are shorter than the idle time and have at most one loop of unknown iteration count. Although this may seem to be a significant restriction, the line rendering functions integrated in STIGLitz have from one to four such loops yet benefited extensively from STI.
STI produces code which is more efficient than context-switching or busy-waiting. The processor spends fewer cycles performing overhead work. The price is expanded code memory. STI may duplicate code, unroll and split loops and add guard instructions. It may also duplicate both host and guest threads. The memory expansion can be reduced by trading off execution speed or timing accuracy. This flexibility allows the tailoring of STI transformations to a particular embedded system’s constraints. For more details please see [5] and related work.
We have developed our optimizing post-pass compiler Thrint in C++ over the past five years. It totals over 20,000 lines of code. In past work [5] we have used Thrint to automatically integrate video refresh code with rendering primitives for the 64-bit Alpha processor. We are in the process of retargeting the appropriate portions to the AVR architecture. Thrint parses AVR and Alpha assembly code, builds control flow and dependence graphs (for internal program representation and user visualization), predicts best- and worst-case code execution schedules, measures idle time and timing jitter, evaluates register data flow, attempts to predict loop iteration counts, plans integration, pads timing variations in conditionals, moves and replicates code regions, unrolls, splits and peels loops, verifies timing correctness of integrated code and finally regenerates a file with flat assembly code. Currently we use Thrint for timing analysis of AVR code, as the full sequence of integration steps is not yet supported for the AVR processor. Table 1 presents useful Thrint command-line switches.
Table 2. Common Thrint Command-Line Switches
|
-Gi |
Create program dependence graph for aiSee
including instructions |
|
-Gc |
Create program dependence graph for aiSee
including instructions and start cycle for each |
|
-i |
Follow directives in specified .id file |
|
-hC |
Form CDGs with improved methods |
|
-S |
Regenerate assembly code |
|
-Tg |
Form histogram of idle time |
|
-hn |
Do not use NOP loops when padding timing
variations |
|
-ha |
Skip data-flow analysis |
|
-hp |
Select options for execution in PC
environment |
|
-he |
Assume all SRAM accesses take one extra
cycle |
Thrint operates in a Unix environment; we use Cygwin on Windows XP. We encourage users to download the Thrint executable and experiment it for their timing analysis needs. Other software we use includes avr-gcc 3.2 for compiling, linking and assembling code, gnu binutils for handling binary files, AVR Studio 3.56 for downloading new code to the MCU, and aiSee for visualizing the program control dependence graphs.
STIGLitz generates an RS170 (NTSC-compatible monochrome) video signal, provides a library of graphics primitive rendering functions integrated with video refresh code and high-speed serial communications but uses very simple hardware. We use Atmega 128 [1] from the Atmel AVR architecture, which features 8 bit native word size, 32 general-purpose registers, and limited support for 16 bit operations. The processor includes 128 kilobytes of Flash program memory, 4 kilobytes of on-board data SRAM and numerous peripherals. The CPU core features a two-stage pipeline; most instructions take one cycle, but some take up to five. An Atmel STK500 evaluation board and STK501 processor expansion card are used to execute the integrated code. These are overclocked at 20 MHz. 64 kilobytes of external SRAM (IDT71124) are used as well, with a one cycle performance penalty, so loads and stores take three cycles. The C compiler used is GCC 3.2 [3]. No operating system is used, although STIGLitz does not preclude the use of one.
The video data portion of the NTSC signal is the most demanding part, as a pixel of video data must be generated every 200 ns (for 256 pixels per row). We use an external shift register to serialize a byte packed with data, reducing the processor loading. On a 20 MHz CPU this corresponds to 16 clock cycles per byte, which is too frequent for context switching or dynamic scheduling. With a 256 pixel wide, two-bit-per-pixel display, sending out a byte takes four cycles, so the idle time remaining is 9 cycles per byte. This comes to 756 cycles (37.8 us) for 64 bytes of video data, and 11.9 million cycles per second, or nearly 60% of the MCU’s time.
A digital-to-analog converter (DAC) converts the serialized
pixels from the data byte to an analog voltage for the

Figure 7.
The
video serializer board is used in conjunction with Atmel’s STK500 AVR Starter
Kit and STK501 Atmega processor daughtercard.

Every 16 cycles during the video portion of the NTSC signal,
the circuit shown in Figure
9 samples the data byte present on an MCU output port,
serializes it into four pixels (using two four-bit shift registers) and
converts it into an

A large structure called STIGLitz holds the data needed for video generation and graphics rendering. It includes the frame-buffer, pen and background colors, queues to hold deferred work, and flags to control work deferral. The graphics rendering and serial communication functions are presented in Table 3. Serial communication at up to 115.2 kbaud is handled by a USART, two circular queues, and routines for enqueueing and dequeuing data. Note that at this rate there is significant timing error due to mismatches between the microcontroller’s baud rate generator and the 20 MHz system clock. Because of this, 57.6 kbaud is the maximum practical baud rate for a 20 MHz clock.
Table 3. STIGlitz Functions
|
STIGLitz_Init |
STIGLitz_SetForeground |
STIGLitz_EraseSprite |
STIGLitz_GIFDecodeInit |
|
STIGLitz_Destroy |
STIGLitz_SetBackground |
STIGLitz_Sprite_MaskGen |
STIGLitz_UART_Init |
|
STIGLitz_DumpFrameBuffer |
STIGLitz_FillRectangle |
STIGLitz_Sprite_TextOut |
STIGLitz__UART_Dequeue_Rx |
|
STIGLitz_DrawLine |
STIGLitz_DrawSprite |
STIGLitz_Sprite_intOut |
STIGLitz_UART_Enqueue_Tx |
|
STIGLitz_DrawCircle |
STIGLitz_DrawSprite_OVR |
STIGLitz_GIFDecode |
STIGLitz_UART_Enqueue_String_Tx |
Figure 10 shows the software structure which allows the application to gather work (graphics primitives to render) to perform during video refresh. The application program specifies if rendering work is to be performed immediately or can be deferred by setting a flag (e.g. DrawLineDefFlag) in STIGLitz. For deferral, parameters for each deferred primitive are saved in the appropriate queue. DrawLine is split into five sub-functions based on line type in order to simplify integration.

Figure 10.
Timeline
shows which function is active when. ISR calls dispatcher, which then calls or
resumes an integrated function which refreshes the display and services the
UART.

A periodic timer-based interrupt triggers an ISR (T_OVF1 in ntsc20MHz.s) which generates the video signal. Its two responsibilities are to draw a full field (which takes 16.17 milliseconds and occurs 60 times per second) and to generate the equalization pulses of the vertical synchronization signal, as seen in Figure 2. To generate a scan line the ISR calls the subroutine Dispatch (in dispatcher.s). As shown in Figure 12, this function first determines if it is in the middle of executing an integrated thread. If so, it resumes that thread using a coroutine call. Two types of cocall are used; one swaps all 32 registers while the other swaps only 24 (for speed reasons). If no integrated thread is executing, the dispatcher examines the queues and selects one of the integrated functions (if data is present in the queue) or else a dedicated busy-wait refresh function. Note that these functions have also been integrated with polling server code which services USART1 using serial communication queues UART_tx_q and UART_rx_q. The chosen thread then reads video data from the frame buffer in memory and sends it out to the CRT through the DAC.

Figure 13 shows how PumpPixel, the USART polling servers, and several graphics rendering functions are integrated together to create the functions called by the dispatcher. These functions service the UART and generate video as needed. An integrated version of the function PumpPixel is called once per scan line (262 times per video refresh interrupt) to send out a row of video data from the frame buffer. At 620 cycles, the idle time within a single call to PumpPixel_UART is too short for most graphics rendering primitives. For example, rendering an 80-pixel long x-major line takes 435 us. As a result, we partition the graphics primitive functions during STI to allow partial progress.
STIGLitz generates various debugging signals on Port D to help determine processor activity. A digital oscilloscope is invaluable for monitoring these signals. Bit 4 is a one when the video ISR or a function it calls it is active. Bit 5 is a one when the dispa