STIGLitz: Efficiently Generating Video with an 8 bit MCU using Software Thread Integration


 

Alexander G. Dean, Benjamin Welch, Shobhit Kanaujia



Center for Efficient, Scalable and Reliable Computing

Dept. of Electrical and Computer Engineering

North Carolina State University

Raleigh, NC 27695-7256




 

1.     INTRODUCTION

1.1     Background

Generating video signals for CRTs or LCDs in software on simple processors with minimal hardware support has been a popular project among microcontroller enthusiasts.  Pioneers in the 1970s include Don Lancaster ‎[18] and the Timex/Sinclair 1000 (Sinclair ZX81). More recently Ricard Gunee ‎[17], Robert Lacoste ‎[15] ‎[16], Bruce Land ‎[20], Alberto Riccibitti ‎[22], Eric Smith ‎[23] and David Thomas ‎[21] have demonstrated systems. There are various useful websites with surveys and tutorials ‎[17]‎[19] ‎[24]. The recent systems use PIC, Ubicom (formerly Scenix) and AVR processors due to their predictable performance and sufficiently fast clock rates. These developers performed manual timing analysis of code and scheduled video operations with nop (“no operation”) instructions to fill in idle time within a video line.  These nops account for a large fraction of the CPU’s time (59% for STIGLitz), wasting processing capacity and leaving little time for other code to execute (0.5 MIPS for Land ‎[20]). 

Some people have manually recovered portions of this idle time for other work by manually moving instructions in an ad hoc (or “add hack”) manner. This reclamation effort is not limited to display applications, but shows up in other applications which need fine-grain concurrency, such as communications systems like modems ‎[25] and network interfaces. These individuals usually describe their work with discouraging terms such as “huge effort” ‎[16] and “difficult to write and debug” ‎[25]. Our approach brings a method to this madness by using modern compiler technology to create code which executes useful system work in place of these nops.

In order to use a single processor effectively, it must be shared, for example with a scheduler, a real-time operating system, and/or interrupt service routines. Most processors are poor at simulating concurrency, as each context switch requires time- or event-based triggering as well as the actual switching. This overhead time rises as switches become more frequent, reducing processor MIPS available for application code.

We have developed methods to merge functions from multiple threads into one function, giving multithreaded concurrency on a simple uniprocessor.  This software thread integration (STI) lets us use cheap, common microcontrollers instead of faster, more expensive ones. We have been building these methods into our optimizing compiler back-end (called Thrint) to help squeeze performance out of generic processors for applications with fine-grain concurrency such as controllers for video refresh and embedded networks. This article describes our STIGLitz project, in which an Atmel Atmega 128 8-bit microcontroller running at 20 MHz generates a monochrome NTSC video signal with STI and simple hardware. It provides a 256 by 240 pixel frame-buffer-based display with two bits per pixel and rendering of lines, circles, sprites and text. It reclaims the time between pixel-output operations for graphics primitive rendering and serial communication use, increasing the system’s graphics performance by a factor of 4x to 13x.

Figure 1.                       Processor utilization for video signal generation by STIGLitz.  Without integration,  12 MIPS of idle time is trapped in fragments only 9 cycles long and is unusable for other processing.  Software thread integration reclaims that idle time to improve rendering performance and service high-speed serial communication.

‎Figure 1 shows the processor utilization for our system under various conditions. The leftmost bar (No Integration) shows that before applying STI, the MCU spends most of its time refreshing the display or executing the nops between video output instructions. This leaves only 0.36 MIPS for foreground processing. The four bars to the right demonstrate processor utilization when rendering various types of lines. STI reclaims large amounts of idle time, providing 1.3 to 4.5 MIPS of line rendering and 2.1 MIPS of serial communication processing instead. Some time is wasted in the dispatcher or context switching, while some is lost because STI integration is not completely efficient when dealing with unpredictable loops. This article describes how to integrate code using STI, describes how the STIGLitz video generation platform works, and how to use it.

1.2     NTSC Video

Monochrome NTSC (RS170) video generation requires the generation of periodic synchronization (sync) pulses. Generating these sync pulses in software is a real-time problem; a late sync pulse deteriorates the picture quality. In this section we briefly review the NTSC monochrome signal to provide the reader with a good understanding of the real-time requirements in video generation.

An NTSC frame contains 525 scan lines and is composed of interlaced raster scan lines. Two sets of raster scans (called fields) are drawn per frame; we refer to these fields as even and odd fields. The fields are made up of 262.5 lines and consist of alternate lines on the screen (hence the term interlaced). Each scan line lasts for 63.5 microseconds, of which only 52.6 microseconds is the visible section and the remaining time is occupied by synchronization.

There are two types of synchronization signals, vertical and horizontal. These reset the scan to the beginning of the next field. As shown in ‎Figure 2, vertical synchronization sequence is composed of three pre-equalization pulses, three vertical synchronization pulses and three post-equalization pulses. The vertical synchronization pulses occur at the field rate of 60 Hz.

 

Figure 1.                       Vertical Synchronization Pulses in an NTSC monochrome signal.

The horizontal sync signals reset the scan to the next line in the field and occur at the line rate of 15.75 kHz. ‎Figure 3 shows an entire scan line of video, including the three parts of the Horizontal synchronization (front and back porch and the horizontal sync) and their respective durations.

Figure 2.                       Horizontal Synchronization Pulses in an NTSC monochrome signal.

 

1.      STI OVERVIEW

1.1     Software Thread Integration

 

Figure 3.                       Overview of hardware to software migration with STI. Idle time is statically filled at compile time with other useful work from the system.

STI works by merging two functions into one implicitly multithreaded function, as shown in ‎Figure 4 ‎[10]‎[4]‎[5]‎[6]‎[7]‎[8]. A software implementation of hardware typically has some idle time between time-critical groups of instructions. When used for real-time software, STI enables the placement of time-critical instructions from one function so they execute at a given time relative to the beginning of the integrated function, regardless of the control or data flow characteristics of either thread. The function with internal idle time is called the primary or guest function, while the function with which it is integrated is called the secondary or host function.

We place various restrictions on the functions to be integrated in order to simplify integration and minimize the additional processing needed. No subroutine calls are allowed in either the primary or secondary functions. The primary function can have no local variables or arguments, as these would require merging the stack frames of both functions and modifying secondary function call sites. Interrupts within the system are disabled for the duration of the integrated threads, leading to increased latency.  We recently have developed methods ‎[13] to support interrupts through polling and cut latency to acceptable levels. In general, loops must have a known number of iterations. However, in many integration cases this restriction is lifted; this is described in more detail below. The integrated line-drawing functions all contain loops with unknown iteration counts.

Several steps are required before thread integration can begin. The first involves structuring the program so that work can be accumulated for an integrated function to perform later. In STIGLitz we use various queues to accumulate graphics primitive rendering work for later processing by secondary functions. The functions to be integrated will dequeue an item of work and process it; one example in STIGLitz is STIGLitz_Service_DrawDiagonalLine.

The second step is to write the functions which will be integrated. We used C for the graphics rendering functions (as well as most of the other code), and assembly language for the video refresh function. The functions should be debugged, optimized and tested as much as practical at this point. Functions which will be integrated must share the register file, so we partition the register file to restrict each function to using particular registers (see Table 1). This simplifies later steps in integration. There are more efficient methods for allocating registers, but we choose partitioning for practicality. C code which must be integrated is compiled to assembly code using a convenient command-line switch for GCC (e.g. –ffixed-r9) which prevents the use of specific registers.

Table 1. Register Use for Integrated Functions

Function

Registers Used

 

Pointer

Immediate

Other

PumpPixel

r26-27 (X)

r19          

 

UART_Tx,  UART_Rx

r30-31 (Z)

r20, r24-25

 

Non-sprite graphics rendering functions (partial context switch)

r28-31 (Y,Z)

r16-18, r20-25

r0-7

Sprite graphics rendering functions (full context switch)

r28-31 (Y,Z)

r16-18, r20-25

r0-15

 

The third step is to perform static timing analysis on the primary and secondary functions to be integrated. This identifies the best-case and worst-case start times for each instruction and region of code. This is a tedious and error-prone chore, which is why we use our optimizing compiler back-end Thrint to perform the analysis on the assembly code (e.g. thrint –Gc –hp –hC file.s). This generates a graph description file which can be processed by the programs aiSee (Windows) or XVCG (Unix) ‎[1] to create a diagram which includes instructions and cycle counts, as shown in ‎Figure 5 and ‎Figure 6. 

 

Figure 4.                       Thrint automatically forms a control dependence graph of STIGLitz_Service_DrawDiagonalLine, performs timing analysis and then pads away timing variations in conditionals with nop instructions (indicated with grey nodes). Thrint creates a graph description file which is rendered by other tools (aiSee or XVCG) to create this image.

Figure 5.                       Detail of CDG shows best- and worst-case starting times for each instruction and region as well as code sizes.

Thrint uses a tree-based form called a control dependence graph (CDG) to represent the control-flow of code, rather than the traditional control-flow graph.  This hierarchical form makes the code much easier to comprehend, analyze and modify for STI. Control dependence regions such as conditionals and loops are represented as non-leaf nodes, and assembly language instructions are stored in leaf nodes. Conditional nesting is represented vertically while execution order is horizontal (left to right). The type of edge connecting a node to its parent determines the condition under which it will be executed (always, when the condition is true, or when it is false). Program regions such as loops and conditionals as well as single basic blocks are moved efficiently in a coarse-grain fashion, yet instructions can be scheduled on a fine-grain basis (within nodes) as needed.  

The fourth preparatory step is to pad uneven duration conditionals in both the primary and secondary functions with nop instructions so the program takes the same amount of time regardless of control flow path. Blocks of nops or loops can be used for padding. Thrint can perform this automatically as well (e.g. thrint –Gc –hp –hC –i file.s file_pad.id) but now requires an integration directives file (e.g. file_pad.id, ‎Listing 1). Thrint creates an output file file.int.s with the padded function in assembly language.

 

TOLERANCE 0 CY

PROCEDURE STIGLitz_Service_DrawDiagonalLine DISCRETE_PADDED

END

Listing 1.                      Integration directives file for padding a function.

The fifth step in preparation is to evaluate the duration of the primary and secondary functions after padding, consider the idle time available in the primary function, periods of tasks, allowable interrupt latencies, and then determine an approach for integration. In some cases a primary function may need to run so often that there is not enough time to execute all of the secondary function before the next instance of the primary.  In this case we partition the secondary function into segments slightly shorter than the available time and integrate the primary function into each segment. ‎Figure 13 shows various long functions broken into segments before integration together in STIGLitz. On the first call, the dispatcher (‎Figure 12) invokes the function.  At the end of each segment but the last a context switch with a coroutine call saves the function’s registers. As time becomes available for executing the integrated code, the function resumes execution with another context switch and coroutine call. In this manner the integrated function will execute segment by segment.  Short-latency tasks such as ISRs are handled with polling servers.  These polling servers are integrated one or more times into the primary function. More details on the segmentation and polling are available ‎[13].

CLOCK_FREQUENCY 20 MHz

PROCEDURE main INTO main ENDS

TOLERANCE 50 ns

   BLOCK EnableCounters AT 0.8 us

   TOLERANCE 0 CY

   LOOP PixelLoop PERIOD 800 ns ITERATION_COUNT 64

   BLOCK PixelOut INTO_LOOP PixelLoop FIRST_AT 1.2 us

END

Listing 2.                      Integration directives file with target times for primary function.

We next define target time ranges (earliest and latest) for each region of primary code to begin execution. In the case of the video generation code, the earliest and latest times for a given region are equal, as any timing variation will lead to the shifting of pixels on the display and degrade image quality. An example of an integration directives file with timing targets appears in ‎Listing 2.

At this stage we finally perform actual integration.  First we present the basic integration methods which handle primary regions not contained in loops (called single-event code). Then we introduce methods for handling primary regions within loops (called looping-event code). These techniques allow integration of arbitrary code.

STI consists of moving regions of code (which could be single instructions) from the primary thread into the secondary thread to execute at the correct times, as presented in ‎Figure 7a. Using the time-annotated CDG of the secondary code as a map, we recursively identify the proper location to place each primary region based upon its target time range.  If this range overlaps with a gap between secondary nodes, we can place the primary region between those nodes. Otherwise the target time range is contained completely within a region, and the secondary code must be modified to open up a gap which overlaps with the target time range.  If the secondary region is a code node, it can simply be split at the appropriate time.  If the region is a conditional, the search for the proper gap is repeated on both the true and false conditional cases.

      

a) General transformations for integrating code                           b) Transformations for integrating loops together

Figure 6.                       Code transformations for software thread integration allow functions with different control flows to be integrated.

If the region is a loop, integration is handled as follows, shown in ‎Figure 7b.  A loop with a known number of iterations can either be split and peeled or left intact. Consider the case of a primary region which needs to begin execution in the middle of iteration 5 of a secondary loop. Splitting and peeling the loop involve duplication; the first copy of the loop executes the first four iterations, the fifth iteration is executed by a peeled-off copy of the loop body, and the remaining iterations are executed by the second copy of the loop. The search for the proper gap is now repeated on the peeled iteration of the secondary loop. The alternative to splitting and peeling a loop is to insert the primary region under a guard region (a conditional branch) which only executes the primary region on a specific iteration. Inserting this guard region requires search for a gap slightly before the target time range. The timing analysis for this approach is more involved, as inserting the guard code disrupts the previous timing analysis.  For STIGLitz we use loop splitting and peeling.

If the primary code contains events within loops, the following approach is used. First, the execution schedules of the primary and secondary functions are compared, accounting for the fact that the secondary function will only execute in the idle time of the primary function. Portions of primary loops containing events that do not overlap with secondary are unrolled and integrated as single events. If a primary loop containing events overlaps with a secondary loop, the overlapping portions of the loops will be fused. Loop fusion consists of merging the two loop bodies and modifying the loop control test to repeat only when both loop-controlling conditions are true. Remaining primary or secondary loop iterations are processed by dedicated “clean-up” loops following the fused loop. An additional aspect to loop fusion involves matching the secondary thread loop iteration time to the available idle time in the primary thread loop body by unrolling the shorter loop. The resulting loop body is padded as needed to meet the timing requirements of the primary looping events.

STI can be performed on code with no more than one loop per thread with an unknown iteration count. This restriction is eliminated in the case of long secondary functions and short primaries. Here the secondary is broken into segments which are shorter than the idle time and have at most one loop of unknown iteration count. Although this may seem to be a significant restriction, the line rendering functions integrated in STIGLitz have from one to four such loops yet benefited extensively from STI.

1.2     Memory Expansion

STI produces code which is more efficient than context-switching or busy-waiting. The processor spends fewer cycles performing overhead work. The price is expanded code memory. STI may duplicate code, unroll and split loops and add guard instructions. It may also duplicate both host and guest threads. The memory expansion can be reduced by trading off execution speed or timing accuracy. This flexibility allows the tailoring of STI transformations to a particular embedded system’s constraints. For more details please see ‎[5] and related work.

1.3     Tools for Automating STI

We have developed our optimizing post-pass compiler Thrint in C++ over the past five years. It totals over 20,000 lines of code. In past work ‎[5] we have used Thrint to automatically integrate video refresh code with rendering primitives for the 64-bit Alpha processor. We are in the process of retargeting the appropriate portions to the AVR architecture. Thrint parses AVR and Alpha assembly code, builds control flow and dependence graphs (for internal program representation and user visualization), predicts best- and worst-case code execution schedules, measures idle time and timing jitter, evaluates register data flow, attempts to predict loop iteration counts, plans integration, pads timing variations in conditionals, moves and replicates code regions, unrolls, splits and peels loops, verifies timing correctness of integrated code and finally regenerates a file with flat assembly code. Currently we use Thrint for timing analysis of AVR code, as the full sequence of integration steps is not yet supported for the AVR processor. Table 1 presents useful Thrint command-line switches.

Table 2. Common Thrint Command-Line Switches

-Gi

Create program dependence graph for aiSee including instructions

-Gc

Create program dependence graph for aiSee including instructions and start cycle for each

-i

Follow directives in specified .id file

-hC

Form CDGs with improved methods

-S

Regenerate assembly code

-Tg

Form histogram of idle time

-hn

Do not use NOP loops when padding timing variations

-ha

Skip data-flow analysis

-hp

Select options for execution in PC environment

-he

Assume all SRAM accesses take one extra cycle

 

Thrint operates in a Unix environment; we use Cygwin on Windows XP. We encourage users to download the Thrint executable and experiment it for their timing analysis needs. Other software we use includes avr-gcc 3.2 for compiling, linking and assembling code, gnu binutils for handling binary files, AVR Studio 3.56 for downloading new code to the MCU, and aiSee for visualizing the program control dependence graphs.

2.      STIGLitz

2.1     System Overview

STIGLitz generates an RS170 (NTSC-compatible monochrome) video signal, provides a library of graphics primitive rendering functions integrated with video refresh code and high-speed serial communications but uses very simple hardware. We use Atmega 128 ‎[1] from the Atmel AVR architecture, which features 8 bit native word size, 32 general-purpose registers, and limited support for 16 bit operations. The processor includes 128 kilobytes of Flash program memory, 4 kilobytes of on-board data SRAM and numerous peripherals. The CPU core features a two-stage pipeline; most instructions take one cycle, but some take up to five. An Atmel STK500 evaluation board and STK501 processor expansion card are used to execute the integrated code. These are overclocked at 20 MHz. 64 kilobytes of external SRAM (IDT71124) are used as well, with a one cycle performance penalty, so loads and stores take three cycles. The C compiler used is GCC 3.2 ‎[3]. No operating system is used, although STIGLitz does not preclude the use of one.

The video data portion of the NTSC signal is the most demanding part, as a pixel of video data must be generated every 200 ns (for 256 pixels per row). We use an external shift register to serialize a byte packed with data, reducing the processor loading. On a 20 MHz CPU this corresponds to 16 clock cycles per byte, which is too frequent for context switching or dynamic scheduling. With a 256 pixel wide, two-bit-per-pixel display, sending out a byte takes four cycles, so the idle time remaining is 9 cycles per byte. This comes to 756 cycles (37.8 us) for 64 bytes of video data, and 11.9 million cycles per second, or nearly 60% of the MCU’s time.

A digital-to-analog converter (DAC) converts the serialized pixels from the data byte to an analog voltage for the NTSC output. There are additional features in a video signal (vertical sync and equalization pulses); these are generated by our software as well. Our system generates a monochrome 256 x 254 pixel image with two bits per pixel, although resolutions of up to 512 x 525 with 1 bit per pixel are possible with minor modifications.

2.2     Hardware

   

Figure 7.                       The video serializer board is used in conjunction with Atmel’s STK500 AVR Starter Kit and STK501 Atmega processor daughtercard.

Figure 8.                       Schematic diagram shows shift registers and clock dividers make up most of the video serializer board.

Every 16 cycles during the video portion of the NTSC signal, the circuit shown in ‎Figure 9 samples the data byte present on an MCU output port, serializes it into four pixels (using two four-bit shift registers) and converts it into an NTSC-compliant voltage. This hardware relieves the MCU of shifting out individual pixels, but still leaves it with the responsibility of ensuring that the video data is present on its output port at the appropriate time. The shift registers are clocked by the pixel clock divider and are loaded by the byte clock divider.  Both dividers are up-counters which can be reconfigured to adjust pixel characteristics.

2.3     Software

2.3.1     Overview and Software Architecture

Figure 9.                       Overview of software architecture. Deferrable graphics work is enqueued for later rendering during video refresh.

A large structure called STIGLitz holds the data needed for video generation and graphics rendering.  It includes the frame-buffer, pen and background colors, queues to hold deferred work, and flags to control work deferral. The graphics rendering and serial communication functions are presented in Table 3. Serial communication at up to 115.2 kbaud is handled by a USART, two circular queues, and routines for enqueueing and dequeuing data. Note that at this rate there is significant timing error due to mismatches between the microcontroller’s baud rate generator and the 20 MHz system clock. Because of this, 57.6 kbaud is the maximum practical baud rate for a 20 MHz clock.

Table 3. STIGlitz Functions

STIGLitz_Init

STIGLitz_SetForeground

STIGLitz_EraseSprite

STIGLitz_GIFDecodeInit

STIGLitz_Destroy

STIGLitz_SetBackground

STIGLitz_Sprite_MaskGen

STIGLitz_UART_Init

STIGLitz_DumpFrameBuffer

STIGLitz_FillRectangle

STIGLitz_Sprite_TextOut

STIGLitz__UART_Dequeue_Rx

STIGLitz_DrawLine

STIGLitz_DrawSprite

STIGLitz_Sprite_intOut

STIGLitz_UART_Enqueue_Tx

STIGLitz_DrawCircle

STIGLitz_DrawSprite_OVR

STIGLitz_GIFDecode

STIGLitz_UART_Enqueue_String_Tx

 

‎Figure 10 shows the software structure which allows the application to gather work (graphics primitives to render) to perform during video refresh. The application program specifies if rendering work is to be performed immediately or can be deferred by setting a flag (e.g. DrawLineDefFlag) in STIGLitz. For deferral, parameters for each deferred primitive are saved in the appropriate queue. DrawLine is split into five sub-functions based on line type in order to simplify integration.

Figure 10.                    Timeline shows which function is active when. ISR calls dispatcher, which then calls or resumes an integrated function which refreshes the display and services the UART.

 

 

Figure 11.                    The dispatcher function is called once per scan line to process work in the graphics queues or to resume execution of a previously started integrated function. If there is no work, a default function (PumpPixel_UART) is called.

A periodic timer-based interrupt triggers an ISR (T_OVF1 in ntsc20MHz.s) which generates the video signal. Its two responsibilities are to draw a full field (which takes 16.17 milliseconds and occurs 60 times per second) and to generate the equalization pulses of the vertical synchronization signal, as seen in ‎Figure 2.  To generate a scan line the ISR calls the subroutine Dispatch (in dispatcher.s).  As shown in ‎Figure 12, this function first determines if it is in the middle of executing an integrated thread.  If so, it resumes that thread using a coroutine call. Two types of cocall are used; one swaps all 32 registers while the other swaps only 24 (for speed reasons). If no integrated thread is executing, the dispatcher examines the queues and selects one of the integrated functions (if data is present in the queue) or else a dedicated busy-wait refresh function. Note that these functions have also been integrated with polling server code which services USART1 using serial communication queues UART_tx_q and UART_rx_q. The chosen thread then reads video data from the frame buffer in memory and sends it out to the CRT through the DAC.


2.3.2     Integrated Functions


Figure 12.                    Overview of which functions are integrated together to create segmented integrated functions which service the USART, refresh the display and perform graphics rendering.

‎Figure 13 shows how PumpPixel, the USART polling servers, and several graphics rendering functions are integrated together to create the functions called by the dispatcher.  These functions service the UART and generate video as needed. An integrated version of the function PumpPixel is called once per scan line (262 times per video refresh interrupt) to send out a row of video data from the frame buffer.  At 620 cycles, the idle time within a single call to PumpPixel_UART is too short for most graphics rendering primitives. For example, rendering an 80-pixel long x-major line takes 435 us. As a result, we partition the graphics primitive functions during STI to allow partial progress.

2.4     Debugging Support

STIGLitz generates various debugging signals on Port D to help determine processor activity. A digital oscilloscope is invaluable for monitoring these signals.  Bit 4 is a one when the video ISR or a function it calls it is active. Bit 5 is a one when the dispatcher is active. Bit 6 is one when PumpPixel_UART is active; this indicates there is no work in the rendering queues. Bit 7 is used for various general debugging purposes. It is defined in stiglitz_defs.h and can be configured to be active when specific segments of integrated functions execute.

3.      PERFORMANCE ANALYSIS

We evaluated the timing accuracy of the integrated code through oscilloscope-based timing measurements and empirical testing; the video signal successfully drives all the television sets tested.

Figure 13.                    Performance of graphics rendering code shows integration increases performance by 4x to 13x.

We measured the line-drawing performance of STIGLitz using the benchmark function SpeedTest in main.c.  This function responds to key presses on the STK500 board and draws a set of 128 lines of with pixel lengths evenly distributed from 1 to 240 (horizontal), 130 (vertical), 170 (diagonal) and 134 (x-major). ‎Figure 14 shows the rendering performance for two different design alternatives: the original discrete rendering and display refresh, and integrated rendering and refresh. In the first case, the graphics primitives are rendered with discrete (non-integrated) code, which can run only when the video refresh ISR is not active, or during the 1.8% of the total time available. The second case uses integrated code when possible to render graphics primitives. Integration speeds up rendering time by 3.99x to 13.54x over the discrete case. The variation in speed-up comes from the amount of rendering work performed per segment and the number of segments needed per secondary thread. Each loop with an unknown iteration count requires at least one segment; this is very inefficient if the loop’s execution time is much less than the idle time of the segment. The horizontal, vertical and diagonal functions all contain a single such loop with a single level of conditionals, allowing efficient integration and only three segments. The x-major function has a much more complex CDG and contains four such loops, one doubly nested. These loops require the formation of nine segments, wasting much of the available idle time. Also, horizontal rendering seems slow because it is drawing more long lines than the other functions.

 


Table 4. Sizes of Original and Integrated Functions

Function

Original Size (bytes)

Padded Size

Integrated Size

Code Expansion Ratio

DrawHorizontalLine

212

260

3120

14.72

DrawVerticalLine

182

246

2718

14.93

DrawDiagonalLine

214

262

3058

14.29

DrawXMajorLine

758

1010

9810

12.94



Table 4 shows how code sizes of functions increase by a factor of 13x to 15x after integration. Various factors contribute to the increase, including padding, loop unrolling and splitting, and code replication into conditionals. Although these code size increases are significant, they apply only to integrated functions, and are an acceptable price to pay given the dramatic performance improvement. Overall, STIGLitz uses 73 kilobytes of ROM and 40 kilobytes of SRAM.  About 16 kilobytes of SRAM are used for the GIF decoder, which could be deleted if not needed.

4.      DOWNLOADS

We are eager to see what readers will come up with based on the STIGLitz platform. To encourage activity, various resources are available at www.cesr.ncsu.edu/agdean/stiglitz for your use. These include C and asm source code for STIGLitz, ExpressPCB files for the video serializer, and an executable version of Thrint for static timing analysis and padding. Our STI research papers are available there as well. Please be sure to check this website for the latest updates, as STIGLitz is being constantly enhanced by my students and me.

5.      CONCLUSIONS AND FUTURE WORK

This article presents STI and demonstrates how it can be used to extract fine-grain concurrency from a common processor. We present a software video generation application with concurrent high-speed serial communication, video refresh and graphics rendering.  We have built and tested the system and performance improves by a factor of about 4x to over 13.5x.

There are various directions for improvements. On the STI theory side, better segment formation methods could offer more consistent performance improvement by using available idle time better when loops have unknown iteration counts. It may be possible to reduce the code size expansion significantly. Implementing more of these transformations automatically in Thrint will accelerate the integration process and simplify debugging. On the STIGLitz side, the NTSC ISR overhead could be reduced and blank scan lines recovered (as in Land’s work) to provide more time for foreground processing. LCDs could be supported with a simple modification to the hardware and changing the frame-buffer to match the display. More graphics rendering primitives could be integrated, and the work deferral queues could be enlarged. The ISR and dispatcher could be modified and hardware added to support video overlay, simple image processing, object tracking or frame capture. Double buffering, an interlaced display and anti-aliased rendering could also be added to improve image quality. Color images are also a possibility. A live system status display showing queue sizes and statistics would also be useful for tuning system parameters and determining which functions are most worth integrating, given the code memory overhead of integration.

Acknowledgements

The authors thank the various students who contributed to this project: Adarsh Seethaye, Deepaksrivats Thirumalai, Robert Morrison, Barret Krupnow, Jimmy Hill, Craig Nowell, and Paul Lee. In addition, thanks go to Atmel and BITS for the kind donations of development tools. This work was funded in part by NSF CAREER grant CCR-0133690.

6.     BIOGRAPHIES

Alex Dean is an assistant professor at North Carolina State University’s Department of Electrical and Computer Engineering. His research focuses on compilers and computer architecture for embedded systems. He has developed and taught introductory and advanced embedded systems courses. He received his MS and PhD in ECE from Carnegie Mellon University. Between the degrees, he worked for over two years at United Technologies Research Center, designing, analyzing and implementing embedded network architectures for jet engines, elevators, cars, and building climate control systems. He can be reached at alex_dean@ncsu.edu.

Ben Welch is currently employed by Sandia National Laboratories. He received his MS in ECE from NCSU in 2003.

Shobhit Kanaujia is a PhD. student at the ECE Department of NCSU. His areas of interests include compilers, embedded systems, and computer architecture. He can be reached via email at sokanauj@ncsu.edu

7.     REFERENCES

[1]     AbsInt. aiSee and XVCG Graph Visualization Software. http://www.absint.com/aisee/

[2]     Atmel Corp., “Atmega 128: 8-Bit AVR Microcontroller with 128K Bytes In-System Programmable Flash,” http://www.atmel.com/dyn/resources/prod_documents/doc2467.pdf

[3]     avr-gcc, http://www.avrfreaks.net/AVRGCC/index.php

[4]     Alexander G. Dean and Richard R. Grzybowski, “A High-Temperature Embedded Network Interface using Software Thread Integration," Second Workshop on Compiler and Architectural Support for Embedded Systems, Washington, DC, October 1-3 1999

[5]     Alexander G. Dean, “Compiling for Concurrency: Planning and Performing Software Thread Integration,” 23rd IEEE Real-Time Systems Symposium, December 3-5, 2002, Austin, TX.

[6]     Dean, A., Shen, J.P. "System-Level Issues for Software Thread Integration: Guest Triggering and Host Selection," 20th IEEE Real-Time Systems Symposium, Phoenix, Arizona, December 1-3, 1999

[7]     Dean, A., Shen, J. P. "Techniques for Software Thread Integration in Real-Time Embedded Systems," 19th IEEE Real-Time Systems Symposium, Madrid, Spain, December 2-4, 1998

[8]     Dean, A., Shen, J. P. "Hardware to Software Migration with Real-Time Thread Integration," 24th EuroMicro Conference, Västerås, Sweden, August 25-27, 1998

[9]     A. Dean. “STIGLitz Project Manual,” CESR Technical Report, www.cesr.ncsu.edu, June 2003.

[10]  Barret Krupnow, Jimmy Hill, Craig Nowell, Paul Lee. “Video Software Thread Integration,” CESR Technical Report, www.cesr.ncsu.edu, December 2002.

[11]  Nagendra J. Kumar, Siddhartha Shivshankar and Alexander G. Dean. “Asynchronous Software Thread Integration for Efficient Software Implementations of Embedded Communication Protocol Controllers,” Center for Efficient, Scalable and Reliable Computing Technical Report, NC State University ECE Department, May 2003.

[12]  Ed Nisley, “Rising Tides”, Dr. Dobb’s Journal, #346, March 2003

[13]  B. Welch, S. Kanaujia, A. Seetharam, D. Thirumalai, A. Dean, “ Extending STI for Demanding Hard-Real-Time Systems,” International Conferences on Compilers, Architecture and Synthesis for Embedded Systems (CASES 2003), November 2003

[14]  J.E. Bresenham, “Algorithm for Computer Control of a Digital Plotter," IBM Systems Journal, 4(1), 1965, pp. 25-30

[15]  Robert Lacoste, “PIC’Spectrum Audio Spectrum Analyzer," Circuit Cellar, September 1998, #98, pp. 24-31

[16]  Robert Lacoste, “The XY-Plotter: Drive High-Resolution LCDs for Less,” Circuit Cellar, September 2003, #133, pp. 42-51

[17]  Ricard Gunee, “Software Generated Video,” http://www.rickard.gunee.com/projects/

[18]  Don Lancaster, Cheap Video Cookbook, Howard W. Sams & Co. Inc., 1978

[19]  Don Lancaster, Don Lancaster’s Tech Musings, #134, 1999, www.tinaja.com

[20]  Bruce Land, “AVR Video Generator: Teaching Programming and Graphics,” Circuit Cellar, January 2003, #150, pp. 40-43

[21]  David Thomas,  http://dt.prohosting.com/pic/pong.html and http://dt.prohosting.com/pic/vidclock.html

[22]  Alberto Riccibitti, http://www.geocities.com/CapeCanaveral/Launchpad/3632/dvm.htm

[23]  Eric Smith, PIC-Tock and PIC-Pong, http://www.brouhaha.com/~eric/pic/

[24]  Ubicom Video Virtual Peripheral Design Challenge and Contest, http://www.sxlist.com/techref/ubicom/contests.htm

[25]  Tom Napier, “Use Frequency Modulation to Send ASCII Data,” Circuit Cellar, January 2003, #150, pp. 12-16