For decades, the trend was for more and more of a computer’s systems to be integrated onto a single chip. Today’s system-on-chips, which power smartphones and servers alike, are the result. But complexity and cost are starting to erode the idea that everything should be on a single slice of silicon.
Already, some of the most of advanced processors, such as AMD’s Zen 2 processor family, are actually a collection of chiplets bound together by high-bandwidth connections within a single package. This week at the IEEE Solid-State Circuits Conference (ISSCC) in San Francisco, French research organization CEA-Leti showed how far this scheme can go, creating a 96-core processor out of six chiplets.
Using an unrelated technology the company had in development, Eta Compute pivoted toward more traditional neural networks such as deep learning and is reaping the rewards. The West Lake Village, Calif.-based company revealed on Wednesday that its first production chips using that technology are now shipping.
Artificial intelligence today is much less than it could be, according to Andrew Feldman, CEO and cofounder of AI computer startup Cerebras Systems.
The problem, as he and his fellow Cerebras founders see it, is that today’s artificial neural networks are too time-consuming and compute-intensive to train. For, say, a self-driving car to recognize all the important objects it will encounter on the road, the car’s neural network has to be shown many, many images of all those things. That process happens in a data center where computers consuming tens or sometimes hundreds of kilowatts are dedicated to what is too often a weeks-long task. Assuming the resulting network can carry out the task with the needed accuracy, the many coefficients that define the strength of connections in the network are then downloaded to the car’s computer, which performs the other half of deep learning, called inference.
Cerebras’s customers—and it already has some, despite emerging from stealth mode only this past summer—complain that training runs for big neural networks on today’s computers can take as long as six weeks. At that rate, they are able to train only maybe six neural networks in a year. “The idea is to test more ideas,” says Feldman. “If you can [train a network] instead in 2 or 3 hours, you can run thousands of ideas.”
When IEEE Spectrum visited Cerebras’s headquarters in Los Altos, Calif., those customers and some potential new ones were already pouring their training data into four CS-1 computers through orange-jacketed fiber-optic cables. These 64-centimeter-tall machines churned away, while the heat exhaust of the 20 kilowatts being consumed by each blew out into the Silicon Valley streets through a hole cut into the wall.
The CS-1 computers themselves weren’t much to look at from the outside. Indeed, about three-quarters of each chassis is taken up with the cooling system. What’s inside that last quarter is the real revolution: a hugely powerful computer made up almost entirely of a single chip. But that one chip extends over 46,255 square millimeters—more than 50 times the size of any other processor chip you can buy. With 1.2 trillion transistors, 400,000 processor cores, 18 gigabytes of SRAM, and interconnects capable of moving 100 million billion bits per second, Cerebras’s Wafer Scale Engine (WSE) defies easy comparison with other systems.
The statistics Cerebras quotes are pretty astounding. According to the company, a 10-rack TPU2 cluster—the second of what are now three generations of Google AI computers—consumes five times as much power and takes up 30 times as much space to deliver just one-third of the performance of a single computer with the WSE. Whether a single massive chip is really the answer the AI community has been waiting for should start to become clear this year. “The [neural-network] models are becoming more complex,” says Mike Demler, a senior analyst with the Linley Group, in Mountain View, Calif. “Being able to quickly train or retrain is really important.”
Customers such as supercomputing giant Argonne National Laboratory, near Chicago, already have the machines on their premises, and if Cerebras’s conjecture is true, the number of neural networks doing amazing things will explode.
When the founders of Cerebras—veterans of Sea Micro, a server business acquired by AMD—began meeting in 2015, they wanted to build a computer that perfectly fit the nature of modern AI workloads, explains Feldman. Those workloads are defined by a few things: They need to move a lot of data quickly, they need memory that is close to the processing core, and those cores don’t need to work on data that other cores are crunching.
This suggested a few things immediately to the company’s veteran computer architects, including Gary Lauterbach, its chief technical officer. First, they could use thousands and thousands of small cores designed to do the relevant neural-network computations, as opposed to fewer more general-purpose cores. Second, those cores should be linked together with an interconnect scheme that moves data quickly and at low energy. And finally, all the needed data should be on the processor chip, not in separate memory chips.
The need to move data to and from these cores was, in large part, what led to the WSE’s uniqueness. The fastest, lowest-energy way to move data between two cores is to have them on the same silicon substrate. The moment data has to travel from one chip to another, there’s a huge cost in speed and power because distances are longer and the “wires” that carry the signals must be wider and less densely packed.
The drive to keep all communications on silicon, coupled with the desire for small cores and local memory, all pointed to making as big a chip as possible, maybe one as big as a whole silicon wafer. “It wasn’t obvious we could do that, that’s for sure,” says Feldman. But “it was fairly obvious that there were big benefits.”
For decades, engineers had assumed that a wafer-scale chip was a dead end. After all, no less a luminary than the late Gene Amdahl, chief architect of the IBM System/360 mainframe, had tried and failed spectacularly at it with a company called Trilogy Systems. But Lauterbach and Feldman say that any comparison with Amdahl’s attempt is laughably out-of-date. The wafers Amdahl was working with were one-tenth the size of today’s, and features that made up devices on those wafers were 30 times the size of today’s.
More important, Trilogy had no way of handling the inevitable errors that arise in chip manufacturing. Everything else being equal, the likelihood of there being a defect increases as the chip gets larger. If your chip is nearly the size of a sheet of letter-size paper, then you’re pretty much asking for it to have defects.
But Lauterbach saw an architectural solution: Because the workload they were targeting favors having thousands of small, identical cores, it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip.
Of course, Cerebras still had to solve a host of manufacturing issues to build its defect-tolerant giganto chip. For example, photolithography tools are designed to cast their feature-defining patterns onto relatively small rectangles, and to do that over and over. That limitation alone would keep a lot of systems from being built on a single wafer, because of the cost and difficulty of casting different patterns in different places on the wafer.
But the WSE doesn’t require that. It resembles a typical wafer full of the exact same chips, just as you’d ordinarily manufacture. The big challenge was finding a way to link those pseudochips together. Chipmakers leave narrow edges of blank silicon called scribe lines around each chip. The wafer is typically diced up along those lines. Cerebras worked with Taiwan Semiconductor Manufacturing Co. (TSMC) to develop a way to build interconnects across the scribe lines so that the cores in each pseudochip could communicate.
With all communications and memory now on a single slice of silicon, data could zip around unimpeded, producing a core-to-core bandwidth of 1,000 petabits per second and an SRAM-to-core bandwidth of 9 petabytes per second. “It’s not just a little more,” says Feldman. “It’s four orders of magnitude greater bandwidth, because we stay on silicon.”
Scribe-line-crossing interconnects weren’t the only invention needed. Chip-manufacturing hardware had to be modified. Even the software for electronic design automation had to be customized for working on such a big chip. “Every rule and every tool and every manufacturing device was designed to pick up a normal-sized chocolate chip cookie, and [we] delivered something the size of the whole cookie sheet,” says Feldman. “Every single step of the way, we have to invent.”
Wafer-scale integration “has been dismissed for the last 40 years, but of course, it was going to happen sometime,” he says. Now that Cerebras has done it, the door may be open to others. “We think others will seek to partner with us to solve problems outside of AI.”
Indeed, engineers at the University of Illinois and the University of California, Los Angeles, see Cerebras’s chip as a boost to their own wafer-scale computing efforts using a technology called silicon-interconnect fabric [see “Goodbye, Motherboard. Hello, Silicon-Interconnect Fabric,” IEEE Spectrum, October 2019]. “This is a huge validation of the research we’ve been doing,” says the University of Illinois’s Rakesh Kumar. “We like the fact that there is commercial interest in something like this.”
The CS-1 is more than just the WSE chip, of course, but it’s not much more. That’s both by design and necessity. What passes for the motherboard is a power-delivery system that sits above the chip and a water-cooled cold plate below it. Surprisingly enough, it was the power-delivery system that was the biggest challenge in the computer’s development.
The WSE’s 1.2 trillion transistors are designed to operate at about 0.8 volts, pretty standard for a processor. There are so many of them, though, that in all they need 20,000 amperes of current. “Getting 20,000 amps into the wafer without significant voltage drop is quite an engineering challenge—much harder than cooling it or addressing the yield problems,” says Lauterbach.
Power can’t be delivered from the edge of the WSE, because the resistance in the interconnects would drop the voltage to zero long before it reached the middle of the chip. The answer was to deliver it vertically from above. Cerebras designed a fiberglass circuit board holding hundreds of special-purpose chips for power control. One million copper posts bridge the millimeter or so from the fiberglass board to points on the WSE.
Delivering power in this way might seem straightforward, but it isn’t. In operation, the chip, the circuit board, and the cold plate all warm up to the same temperature, but they expand when doing so by different amounts. Copper expands the most, silicon the least, and the fiberglass somewhere in between. Mismatches like this are a headache in normal-size chips because the change can be enough to shear away their connection to a printed circuit board or produce enough stress to break the chip. For a chip the size of the WSE, even a small percentage change in size translates to millimeters.
“The challenge of [coefficient of thermal expansion] mismatch with the motherboard was a brutal problem,” says Lauterbach. Cerebras searched for a material with the right intermediate coefficient of thermal expansion, something between those of silicon and fiberglass. Only that would keep the million power-delivery posts connected. But in the end, the engineers had to invent one themselves, an endeavor that took a year and a half to accomplish.
In 2018, Google, Baidu, and some top academic groups began working on benchmarks that would allow apples-to-apples comparisons among systems. The result, MLPerf, released training benchmarks in May 2018.
According to those benchmarks, the technology for training neural networks has made some huge strides in the last few years. On the ResNet-50 image-classification problem, the Nvidia DGX SuperPOD—essentially a 1,500-GPU supercomputer—finished in 80 seconds. It took 8 hours on Nvidia’s DGX-1 machine (circa 2017) and 25 days using the company’s K80 from 2015.
Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead the company prefers to let customers try out the CS-1 using their own neural networks and data.
This approach is not unusual, according to analysts. “Everybody runs their own models that they developed for their own business,” says Karl Freund, an AI analyst at Moor Insights. “That’s the only thing that matters to buyers.”
Early customer Argonne National Labs, for one, has some pretty intense needs. In training a neural network to recognize, in real time, different types of gravitational-wave events, scientists recently used one-quarter of the resources of Argonne’s megawatt-consuming Theta supercomputer, the 28th most powerful system in the world.
Cutting power consumption down to mere kilowatts seems like a key benefit in supercomputing. Unfortunately, Lauterbach doubts that this feature will be much of a selling point in data centers. “While a lot of data centers talk about [conserving] power, when it comes down to it…they don’t care,” he says. “They want performance.” And that’s something a processor nearly the size of a dinner plate can certainly provide.
This article appears in the January 2020 print issue as “Huge Chip Smashes Deep Learning’s Speed Barrier.”
Three research directions should bind chiplets more tightly together
Packaging has arguably never been a hotter subject. With Moore’s Law no longer providing the oomph it once did, one path to better computing is to connect chips more tightly together within the same package.
At Semicon West earlier this month, Intel showed off three new research efforts in packaging. One combines two of its existing technologies to more tightly integrate chiplets—smaller chips linked together in a package to form the kind of system that would, until recently, be made as a single large chip. Another adds better power delivery to dies at the top of a 3D stack of chips. And the final one is an improvement on Intel’s chiplet-to-chiplet interface called Advanced Interface Bus (AIB).
Michigan team builds memristors atop standard CMOS logic to demo a system that can do a variety of edge computing AI tasks
Hoping to speed AI and neuromorphic computing and cut down on power consumption, startups, scientists, and established chip companies have all been looking to do more computing in memory rather than in a processor’s computing core. Memristors and other nonvolatile memory seem to lend themselves to the task particularly well. However, most demonstrations of in-memory computing have been in standalone accelerator chips that either are built for a particular type of AI problem or that need the off-chip resources of a separate processor in order to operate. University of Michigan engineers are claiming the first memristor-based programmable computer for AI that can work on all its own.
Individual accelerator chips can be ganged together in a single module to tackle both the small jobs and the big ones without sacrificing efficiency
There’s no doubt that GPU-powerhouse Nvidia would like to have a solution for all size scales of AI—from massive data center jobs down to the always-on, low-power neural networks that listen for wakeup words in voice assistants.
Right now, that would take several different technologies, because none of them scale up or down particularly well. It’s clearly preferable to be able to deploy one technology rather than several. So, according to Nvidia chief scientist Bill Dally, the company has been seeking to answer the question: “Can you build something scalable… while still maintaining competitive performance-per-watt across the entire spectrum?”
It looks like the answer is yes. Last month at the VLSI Symposia in Kyoto, Nvidia detailed a tiny test chip that can work on its own to do the low-end jobs or be linked tightly together with up to 36 of its kin in a single module to do deep learning’s heavy lifting. And it does it all while achieving roughly the same top-class performance.
The individual accelerator chip is designed to perform the execution side of deep learning rather than the training part. Engineers generally measure the performance of such “inferencing” chips in terms of how many operations they can do per joule of energy or millimeter of area. A single one of Nvidia’s prototype chips peaks at 4.01 tera-operations per second (1000 billion operations per second) and 1.29 TOPS per millimeter. Compared to prior prototypes from other groups using the same precision the single chip was at least 16 times as area efficient and 1.7 times as energy efficient. But linked together into a 36-chip system it reached 127.8 TOPS. That’s a 32-fold performance boost. (Admittedly, some of the efficiency comes from not having to handle higher-precision math, certain DRAM issues, and other forms of AI besides convolutional neural nets.)
With this research Nvidia is trying to demonstrate that one technology can operate well in all those situations. Or at least it can if the chips are linked together with Nvidia’s mesh network in a multichip module. These modules are essentially small printed circuit boards or slivers of silicon that hold multiple chips in a way that they can be treated as one large IC. They are becoming increasingly popular, because they allow systems composed of a couple of smaller chips—often called chiplets—instead of a single larger and more expensive chip.
“The multichip module option has a lot of advantages not just for future scalable [deep learning] accelerators but for building version of our products that have accelerators for different functions,” explains Dally.
Key to the Nvidia multichip module’s ability to bind together the new deep learning chips is an interchip network that uses a technology called ground-referenced signaling. As its name implies, GRS uses the difference between a voltage signal on a wire and a common ground to transfer data, while avoiding many of the known pitfalls of that approach. It can transmit 25 gigabits/s using a single wire, whereas most technologies would need a pair of wires to reach that speed. Using single wires boosts how much data you can stream off of each millimeter of the edge of the chip to a whopping terabit per second. What’s more, GRS’s power consumption is a mere picojoule per bit.
“It’s a technology that we developed to basically give the option of building multichip modules on an organic substrate, as opposed to on a silicon interposer, which is much more expensive technology,” says Dally.
The accelerator chip presented at VLSI is hardly the last word on AI from Nvidia. Dally says they’ve already completed a version that essentially doubles this chip’s TOPS/W. “We believe we can do better than that,” he says. His team aspires to find inferencing accelerating techniques that blow past the VLSI prototype’s 9.09 TOPS/W and reaches 200 TOPS/W while still being scalable.
Semiconductor industry mavens in the United States anticipate damage from U.S.-China trade policy and call for a national strategy for semiconductor manufacturing
“There is going to be a lot of pain for the semiconductor industry before it normalizes,” says Dan Hutcheson.
“It’s a mess, and it’s going to get a lot worse before it gets better,” says David French.
“If we aren’t going to sell them chips, it is not going to take them long [to catch up to us]; it is going to hurt us,” says Mar Hershenson.
French, Hutcheson, and Hershenson, along with Ann Kim and Pete Rodriguez, were discussing the U.S.-China trade war that escalated last month when the United States placed communications behemoth Huawei on a trade blacklist. All five are semiconductor industry veterans and investors: French is currently chairman of Silicon Power Technology; Hutcheson is CEO of VLSI Research; Hershenson is managing partner of Pear Ventures, Kim is managing director of Silicon Valley Bank’s Frontier Technology Group, and Rodriguez is CEO of startup incubator Silicon Catalyst. The five took the stage at Silicon Catalyst’s second industry forum, held in Santa Clara, Calif., last week to discuss several aspects of the trade war:
A 49-core chip by Georgia Tech uses a 1980s-era algorithm to solve some of today’s toughest optimization problems faster than a GPU
Engineers at Georgia Tech say they’ve come up with a programmable prototype chip that efficiently solves a huge class of optimization problems, including those needed for neural network training, 5G network routing, and MRI image reconstruction. The chip’s architecture embodies a particular algorithm that breaks up one huge problem into many small problems, works on the subproblems, and shares the results. It does this over and over until it comes up with the best answer. Compared to a GPU running the algorithm, the prototype chip—called OPTIMO—is 4.77 times as power efficient and 4.18 times as fast.
The training of machine learning systems and a wide variety of other data-intensive work can be cast as a set of mathematical problem called constrained optimization. In it, you’re trying to minimize the value of a function under some constraints, explains Georgia Tech professor Arijit Raychowdhury. For example, training a neural net could involve seeking the lowest error rate under the constraint of the size of the neural network.
“If you can accelerate [constrained optimization] using smart architecture and energy-efficient design, you will be able to accelerate a large class of signal processing and machine learning problems,” says Raychowdhury. A 1980s-era algorithm called alternating direction method of multipliers, or ADMM, turned out to be the solution. The algorithm solves enormous optimization problems by breaking them up and then reaching a solution over several iterations.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.