Faculty: Eric Rotenberg
Sponsors: Intel, NSF, SRC
Single processor performance growth has slowed, as technology trends and architectural bottlenecks no longer support deeper and wider pipelining. While multiple processors on the same chip (multi-core or many-core) is an important trend, it is not a panacea. Much software is non-parallel, complex, and irregular, and cannot take advantage of multiple processors in traditional ways. Thus, there is some incentive to couple multi-core substrates with new sequential-program-centric execution models.
This project explores new execution models that try to capture the current influence of a running program's data objects on the whole program, potentially extending the reach of the processor to the whole program. In one proposed architecture, changes to objects are met in real-time by specializing the few candidate fragments of code that could operate next on the objects, simplifying and resolving the code based on the new data stored within the objects. This significantly reduces the length of the running program. Although the program itself is not parallelized (it is reduced), the many independent specialization tasks corresponding to objects provide abundant meta-parallelism. The responsibility for different objects is distributed among different processors in a multi-core or many-core substrate. One interpretation is that the specialization tasks among many cores constitute a virtual dataflow execution of the whole program. Another proposed facet is maintaining persistent meta-data about the program among specialization cores -- information a compiler has (and possibly more) but which is now embedded in the execution fabric, enabled by the potentially deep computation and memory capacity of future many-core substrates.
This project encompasses multiple dimensions, architectures, and applications, with complementary aspects funded by Intel, NSF, and SRC.
Faculty: Eric Rotenberg
Sponsors: NSF, Texas Instruments, CACC
Next-generation computing/communication devices, such as cell phones and wireless sensor network nodes, will require deeper storage capacity as functionality and feature sets increase. This future can be better met by dealing with the DRAM refresh problem and thereby reap the capacity benefits of DRAM without impacting battery life.
The key lies with exploiting dramatic variations in retention times among different DRAM pages. We recently proposed Retention-Aware Placement in DRAM (RAPID), novel software approaches that can exploit off-the-shelf DRAMs to reduce refresh power to vanishingly small levels approaching non-volatile memory. The key idea is to favor longer-retention pages over shorter-retention pages when
allocating DRAM pages. This allows selecting a single refresh period that depends on the shortest-retention page among populated pages, instead of the shortest-retention page overall. We explored three versions of RAPID and observed refresh energy savings of 83%, 93%, and 95%, relative to conventional temperature-compensated refresh. RAPID with off-the-shelf DRAM also approaches the energy levels of idealized techniques that require custom DRAM support. This ultimately yields a software implementation of quasi-non-volatile DRAM.
In addition to providing real value for highly-functional, energy-constrained, and cost-constrained computing/communication devices, we believe RAPID is inexpensively deployable because it is based solely on software and commodity off-the-shelf DRAM. The next step in this research is to integrate RAPID into one or more real system prototypes, including a cell phone and a wireless sensor network node. We are currently designing and building a wireless sensor network which will be used as a testbed for experimenting with RAPID.
This project is sponsored by the Center for Advanced Computing and Communication, the National Science Foundation, and Texas Instruments.
Faculty: Tom Conte
Sponsor: EEMBC
The Embedded Microprocessor Benchmark Consortium (EEMBC) is sponsoring a project to perform benchmark characterization of their benchmark suites. This research aims to find a representative set of metrics that characterize the underpinnings of the benchmark activity. The final goal is to have a method where EEMBC users can match their own workloads to an EEMBC characterization, in order to determine which benchmarks are most representative of their application.
A unique aspect of the metrics that are used is the use of performance targeted hardware metrics. For example, one measurement is finding the minimum cache requirements to achieve a 1% or 0.1% miss ratio. Only investigating benchmark performance on a particular system, as is usually the case, may mask some of the benchmark’s important activity. However the measurement of performance targeted hardware requirements depends primarily on inherent program behavior of the workload, and thus gives a clearer picture of the benchmark activity.
In the future this work will aim to allow a methodology for matching user workloads to the EEMBC characterizations, and will also explore the usefulness of the hardware metrics for designing benchmark-specific processors. This will prove useful to designers in the embedded world which are faced with tough hardware choices when choosing the optimal system for their application.
Faculty: Tom Conte
Sponsor: Qualcomm
An optimizing compiler is a major asset to any embedded processor. Many powerful optimizing-compilers, such as ARM-CC, are closed source making third-party modification impossible. Open-source compilers such as GCC are able to create high-performance codes. These compilers can be easily ported to produce code for a wide variety of architectures. The main drawback of these compilers is that they are very complex to implement target-specific optimizations. In GCC, implementing the machine’s hazards and pipeline stages is prohibitively hard and highly prone to errors.
To facilitate the incorporation of the machine description into GCC (gcc-dfa), we created a high-level machine description to gcc-dfa translator. The language used to create this high-level machine description (MDES) is highly intuitive and trivial to implement. Our open-source translator (HMDES2MD) parses this description and converts them into appropriate GCC-equivalent machine-description. Figure below shows the data path of a simple processor and its equivalent representation in both gcc-md format and MDES format. It is easy to see that MDES seems to be a more intelligible for the naked-eye.

This tool is currently used at Qualcomm in Cary (Research Triangle Park), NC to create an open-source experimental compiler for their new ARM ISA-based Scorpion Processor. The backend of this machine is written in MDES and translated to GCC-DFA using HMDES2MD translator. In addition to implementing the backend, this open-source compiler is used to test additional scheduling techniques such as using aggressive schedulers such as a treegion scheduler vs. a simple list-scheduler. The GCC framework with the simple backend generator also facilitated studying and controlling instruction-mixes and substitutions to increase its performance.
Faculty: Alex Dean
Many real-time (RT) embedded systems could benefit from a memory hierarchy to bridge the processor/memory speed gap. These RT embedded systems usually utilize a cacheless architecture to avoid the time variability which complicates the timing analysis essential for RT systems. In the absence of a cache the burden of allocating the data objects to the memory hierarchy is on the programmer or compiler.
We have developed a synergistic, optimal approach to allocating data objects and scheduling real-time tasks for embedded systems. We allocate data using integer linear programming (ILP) to minimize each task's worst-case execution time (WCET), then perform preemption threshold scheduling (PTS) on the tasks to reduce stack memory requirements while still meeting hard RT deadlines. The memory reduction of PTS allows these steps to be repeated. The data objects now require less memory, so more can fit into faster memory, further reducing WCET. The increased slack time can be used by PTS to reduce preemptions further, until a fixed point is reached.