Monthly Archives: July 2015

Communication among processors sustains fast, massively parallel computer

Designers at Maspar Computer Corp in Sunnyvale, CA have developed two chips intended to break the communications bottleneck that often occurs in highly parallel computers. One chip serves to simplify and accelerate global interprocessor communications. The second chip carries 32 highly interconnected processing elements (PEs). A fully configured MasPar system can yield performance rates of 10,000 MIPS and 1,000 MFLOPS. Two independent communication schemes are built into the PE chips. A neighborhood mesh connects each PE to its eight nearest neighbors, while a multistage crossbar hierarchy allows each PE to connect to any other PE in the array. The 32 processors on each PE chip are formed from 500,000 transistors. RISC-style load-store architecture with local cache memory keeps each PE as small as possible.

Highly parallel computers employ hundreds, even thousands, of small processors to achieve astounding computational rates. Execution rates can reach billions of operations per second. At the same time, however, the interprocessor communication needed for sending commands and transferring data can become a bottleneck when so many processors run simultaneously.

To break that bottleneck, designers at MasPar Computer Corp., Sunnyvale, Calif., developed two custom chips: One simplifies and accelerates global interprocessor communications, and the other supplies 32 highly interconnected processing elements (PEs).

The MasPar system can harness from 1024 to 16,384 processor elements and, when fully configured, deliver 10,000 MIPS (millions instructions per second) and 1000 MFLOPS (million floating-point operations per second). The system employes a single-instruction-stream and multiple-data (SIMD) architecture manipulated by an array-control unit (ACU).

MasPar at NASA/GSFC

The ACU fetches and decodes instructions and issues control signals to all PEs. All PEs execute the same instruction stream, but each can also have local autonomy over execution, allowing for localized calculations.

To achieve the high bandwidth needed for thousands of PEs to communicate among themselves, MasPar designers built two independent communication schemes into the system. One is a neighborhood mesh, or X-net local interconnection that ties each PE to its eight nearest neighbors. The other is a multistage crossbar hierarchy that lets each PE connect to any other PE in the array. The X-net forms a 2D grid that wraps around East to West and North to South to form a torus-like pattern (see the figure below).

Within each PE chip are packed 500,000 transistors that form 32 processors interconnected in a 2d grid. Multiple PE chips are interconnected in the same way that processors are within a chip.

According to Jeff Kalb, MasPar Computer’s founder and president, each PE is kept as small as possible by using a RISC-style, load-store architecture with local cache memory for each PE.

What’s more, only four paths per PE are needed to communicate in eight directions. That’s because the X-shaped crossing points of the communication paths are three-state nodes that switch the data to one of two paths. All of the interprocessor X-net connections use bit-serial communications, so just 24 pins per PE chip are required for the X-net interface.

Some computational problems don’t map well onto an X-net topology, however, and require arbitrary interconnections among PEs. To tackle that problem, a team headed by Tom Blank, director of architecture and application development, designed the multiple custom router chips so they could form a multistage interconnection network, which is somewhat like a hierarchy of crossbar switches. Each router chip has 64 datapath inputs and 64 datapath outputs, and can switch any input to any output.

When a PE chip sends data to a router, it first multiplexes the output from 16 PEs onto one outgoing router path and one incoming router path. Router paths are bit serial, so only four pins are needed on each PE chip to multiplex 32 PEs.

Once a connection is established, the router’s bidirectional data paths send or fetch data. In a 16,384-PE system, up to 1024 simultaneous connections can be established, thus giving an aggregate data rate of over 1 Gbyte/s.

The multistage interconnection network is well matched to SIMD architectures because all connections occur with the same timing and sequence in each PE. The common timing and sequencing greatly simplifies the logic compared to a hypercube-type architecture.

In a hypercube, different path lengths can cause messages to arrive at each PE at different times, raising hard-to-solve timing considerations. In contrast, the router network in MasPar’s system keeps the message delays identical, eliminating timing problems. In addition, the router network tracks parity for all data and addresses to ensure highly reliable transfers–a critical task when thousands of processors are involved. Furthermore, each router chip includes diagnostic logic that detects open wires in the network.