1000-processor chip churns out 1.78 trillion instructions per second

June 22, 2016 // By Graham Prophet
A team from the University of California, Davis, Department of Electrical and Computer Engineering, has designed a 1000-core processor, with an ultimate throughput rate of 1.78 trillion instructions per second and containing 621 million transistors. The "KiloCore" was presented at the 2016 Symposium on VLSI Technology and Circuits in Honolulu on June 16, 2016.

“To the best of our knowledge, it is the world’s first 1,000-processor chip and it is the highest clock-rate processor ever designed in a university,” said Bevan Baas, professor of electrical and computer engineering, who led the team that designed the chip architecture. While other multiple-processor chips have been created, none exceed about 300 processors, according to an analysis by Baas’ team.

Most were created for research purposes and few are sold commercially. The KiloCore chip has been fabricated and run; it was built by IBM using its 32-nm PD-SOI CMOS technology.

The basic architecture is MIMD (multiple instruction/multiple data) and each of the seven-stage-pipelined cores is a general purpose unit with a 72-instruction set, single instruction/cycle. The team says that none of the instructions is ‘algorithm-specific’ - so distinguishing it from a GPU-class device. The 1.78-trillion instructions/sec figures comes with a clock speed of 1.78 GHz, at 1.1 V: running at 0.84 V and 1 GHz consumes 13.1 W, while peak power efficiency of 5.8 pJ/Op is quoted at 0.56 V and 115 MHz.

Each core is independently powered and can shut down to leakage-only power if it has no task to perform. Rather than a cache architecture, every processor can store instructions and data in a hierarchy of locations; local memory, one or more nearby processors, on-chip independent memory modules, or off-chip memory.

Each processor communicates via a high-throughput circuit-switched network plus a packet-switched network (both on-chip). The team says there is little energy overhead to source operands from companion processors some way across the chip, as ‘wormhole’ routing is employed. That is, messages from an adjacent or nearby core will be routed via the ‘circuit’ network; those from further away in the processor matrix will travel via the packet network.

Each core has north-south-east-west comms buffers plus a fifth channel for host-processor traffic; maximum throughput is 45.5 Gbps per router and 9.1 Gbps per port at 1.1V.