An implementation or the Advanced RISC Machine microprocessor architecture using the micropipeline design style. In April 1994 the Amulet group in the Computer Science department of Manchester University took delivery of the AMULET1 microprocessor. This was their first large scale asynchronous circuit and the world's first implementation of a commercial microprocessor architecture (ARM) in asynchronous logic.
Work was begun at the end of 1990 and the design despatched for fabrication in February 1993. The primary intent was to demonstrate that an asynchronous microprocessor can consume less power than a synchronous design.
The design incorporates a number of concurrent units which cooperate to give instruction level compatibility with the existing synchronous part. These include an Address unit, which autonomously generates instruction fetch requests and interleaves (nondeterministically) data requests from the Execution unit; a Register file which supplies operands, queues write destinations and handles data dependencies; an Execution unit which includes a multiplier, a shifter and an ALU with data-dependent delay; a Data interface which performs byte extraction and alignment and includes an instruction prefetch buffer, and a control path which performs instruction decode. These units only synchronise to exchange data.
The design demonstrates that all the usual problems of processor design can be solved in this asynchronous framework: backward instruction set compatibility, interrupts and exact exceptions for memory faults are all covered. It also demonstrates some unusual behaviour, for instance nondeterministic prefetch depth beyond a branch instruction (though the instructions which actually get executed are, of course, deterministic). There are some unusual problems for compiler optimisation, as the metric which must be used to compare alternative code sequences is continuous rather than discrete, and the nondeterminism in external behaviour must also be taken into account.
The chip was designed using a mixture of custom datapath and compiled control logic elements, as was the synchronous ARM. The fabrication technology is the same as that used for one version of the synchronous part, reducing the number of variables when comparing the two parts.
Two silicon implementations have been received and preliminary measurements have been taken from these. The first is a 0.7um process and has achieved about 28 kDhrystones running the standard benchmark program. The other is a 1 um implementation and achieves about 20 kDhrystones. For the faster of the parts this is equivalent to a synchronous ARM6 clocked at around 20MHz; in the case of AMULET1 it is likely that this speed is limited by the memory system cycle time (just over 50ns) rather than the processor chip itself.
A fair comparison of devices at the same geometries gives the AMULET1 performance as about 70% of that of an ARM6 running at 20MHz. Its power consumption is very similar to that of the ARM6; the AMULET1 therefore delivers about 80 MIPS/W (compared with around 120 from a 20MHz ARM6). Multiplication is several times faster on the AMULET1 owing to the inclusion of a specialised asynchronous multiplier. This performance is reasonable considering that the AMULET1 is a first generation part, whereas the synchronous ARM has undergone several design iterations. AMULET2 (under development in 1994) was expected to be three times faster than AMULET1 and use less power.
The macrocell size (without pad ring) is 5.5 mm by 4.5 mm on a 1 micron CMOS process, which is about twice the area of the synchronous part. Some of the increase can be attributed to the more sophisticated organisation of the new part: it has a deeper pipeline than the clocked version and it supports multiple outstanding memory requests; there is also specialised circuitry to increase the multiplication speed. Although there is undoubtedly some overhead attributable to the asynchronous control logic, this is estimated to be closer to 20% than to the 100% suggested by the direct comparison.
The work was part of a broad ESPRIT funded investigation into low-power technologies within the European Open Microprocessor systems Initiative (OMI) programme, where there is interest in low-power techniques both for portable equipment and (in the longer term) to alleviate the problems of the increasingly high dissipation of high-performance chips. This initial investigation into the role asynchronous logic might play has now demonstrated that asynchronous techniques can be applied to problems of the scale of a complete microprocessor.