Engineers at Purdue University and at Georgia Tech have constructed the first devices from a new kind of two-dimensional material that combines memory-retaining properties and semiconductor properties. The engineers used a newly discovered ferroelectric semiconductor, alpha indium selenide, in two applications: as the basis of a type of transistor that stores memory as the amount of amplification it produces; and in a two-terminal device that could act as a component in future brain-inspired computers. The latter device was unveiled last month at the IEEE International Electron Devices Meeting in San Francisco.
Ferroelectric materials become polarized in an electric field and retain that polarization even after the field has been removed. Ferroelectric RAM cells in commercial memory chips use the former ability to store data in a capacitor-like structure. Recently, researchers have been trying to coax more tricks from these ferroelectric materials by bringing them into the transistor structure itself or by building other types of devices from them.
In particular, they’ve been embedding ferroelectric materials into a transistor’s gate dielectric, the thin layer that separates the electrode responsible for turning the transistor on and off from the channel through which current flows. Researchers have also been seeking a ferroelectric equivalent of the memristors, or resistive RAM, two-terminal devices that store data as resistance. Such devices, called ferroelectric tunnel junctions, are particularly attractive because they could be made into a very dense memory configuration called a cross-bar array. Many researchers working on neuromorphic- and low-power AI chips use memristors to act as the neural synapses in their networks. But so far, ferroelectric tunnel junction memories have been a problem.
“It’s very difficult to do,” says IEEE Fellow Peide Ye, who led the research at Purdue University. Because traditional ferroelectric materials are insulators, when the device is scaled down, there’s too little current passing through, explains Ye. When researchers try to solve that problem by making the ferroelectric layer very thin, the layer loses its ferroelectric properties.
Instead, Ye’s group sought to solve the conductance problem by using a new ferroelectric material—alpha indium selenide— that acts as a semiconductor instead of an insulator. Under the influence of an electric field, the molecule undergoes a structural change that holds the polarization. Even better, the material is ferroelectric even as a single-molecule layer that is only about a nanometer thick. “This material is very unique,” says Ye.
Ye’s group made both transistors and memristor-like devices using the semiconductor. The memristor-like device, which they called a ferroelectric-semiconductor junction (FSJ), is just the semiconductor sandwiched between two conductors. This simple configuration could be formed into a dense cross-bar array and potentially shrunk down so that each device is only about 10 nanometers across, says Ye.
Proving the ability to scale the device down is the next goal for the research, along with characterizing how quickly the devices can switch, explains Ye. Further on, his team will look at applications for the FSJ in neuromorphic chips, where researchers have been trying a variety of new devices in the search for the perfect artificial neural synapse.
Artificial intelligence today is much less than it could be, according to Andrew Feldman, CEO and cofounder of AI computer startup Cerebras Systems.
The problem, as he and his fellow Cerebras founders see it, is that today’s artificial neural networks are too time-consuming and compute-intensive to train. For, say, a self-driving car to recognize all the important objects it will encounter on the road, the car’s neural network has to be shown many, many images of all those things. That process happens in a data center where computers consuming tens or sometimes hundreds of kilowatts are dedicated to what is too often a weeks-long task. Assuming the resulting network can carry out the task with the needed accuracy, the many coefficients that define the strength of connections in the network are then downloaded to the car’s computer, which performs the other half of deep learning, called inference.
Cerebras’s customers—and it already has some, despite emerging from stealth mode only this past summer—complain that training runs for big neural networks on today’s computers can take as long as six weeks. At that rate, they are able to train only maybe six neural networks in a year. “The idea is to test more ideas,” says Feldman. “If you can [train a network] instead in 2 or 3 hours, you can run thousands of ideas.”
When IEEE Spectrum visited Cerebras’s headquarters in Los Altos, Calif., those customers and some potential new ones were already pouring their training data into four CS-1 computers through orange-jacketed fiber-optic cables. These 64-centimeter-tall machines churned away, while the heat exhaust of the 20 kilowatts being consumed by each blew out into the Silicon Valley streets through a hole cut into the wall.
The CS-1 computers themselves weren’t much to look at from the outside. Indeed, about three-quarters of each chassis is taken up with the cooling system. What’s inside that last quarter is the real revolution: a hugely powerful computer made up almost entirely of a single chip. But that one chip extends over 46,255 square millimeters—more than 50 times the size of any other processor chip you can buy. With 1.2 trillion transistors, 400,000 processor cores, 18 gigabytes of SRAM, and interconnects capable of moving 100 million billion bits per second, Cerebras’s Wafer Scale Engine (WSE) defies easy comparison with other systems.
The statistics Cerebras quotes are pretty astounding. According to the company, a 10-rack TPU2 cluster—the second of what are now three generations of Google AI computers—consumes five times as much power and takes up 30 times as much space to deliver just one-third of the performance of a single computer with the WSE. Whether a single massive chip is really the answer the AI community has been waiting for should start to become clear this year. “The [neural-network] models are becoming more complex,” says Mike Demler, a senior analyst with the Linley Group, in Mountain View, Calif. “Being able to quickly train or retrain is really important.”
Customers such as supercomputing giant Argonne National Laboratory, near Chicago, already have the machines on their premises, and if Cerebras’s conjecture is true, the number of neural networks doing amazing things will explode.
When the founders of Cerebras—veterans of Sea Micro, a server business acquired by AMD—began meeting in 2015, they wanted to build a computer that perfectly fit the nature of modern AI workloads, explains Feldman. Those workloads are defined by a few things: They need to move a lot of data quickly, they need memory that is close to the processing core, and those cores don’t need to work on data that other cores are crunching.
This suggested a few things immediately to the company’s veteran computer architects, including Gary Lauterbach, its chief technical officer. First, they could use thousands and thousands of small cores designed to do the relevant neural-network computations, as opposed to fewer more general-purpose cores. Second, those cores should be linked together with an interconnect scheme that moves data quickly and at low energy. And finally, all the needed data should be on the processor chip, not in separate memory chips.
The need to move data to and from these cores was, in large part, what led to the WSE’s uniqueness. The fastest, lowest-energy way to move data between two cores is to have them on the same silicon substrate. The moment data has to travel from one chip to another, there’s a huge cost in speed and power because distances are longer and the “wires” that carry the signals must be wider and less densely packed.
The drive to keep all communications on silicon, coupled with the desire for small cores and local memory, all pointed to making as big a chip as possible, maybe one as big as a whole silicon wafer. “It wasn’t obvious we could do that, that’s for sure,” says Feldman. But “it was fairly obvious that there were big benefits.”
For decades, engineers had assumed that a wafer-scale chip was a dead end. After all, no less a luminary than the late Gene Amdahl, chief architect of the IBM System/360 mainframe, had tried and failed spectacularly at it with a company called Trilogy Systems. But Lauterbach and Feldman say that any comparison with Amdahl’s attempt is laughably out-of-date. The wafers Amdahl was working with were one-tenth the size of today’s, and features that made up devices on those wafers were 30 times the size of today’s.
More important, Trilogy had no way of handling the inevitable errors that arise in chip manufacturing. Everything else being equal, the likelihood of there being a defect increases as the chip gets larger. If your chip is nearly the size of a sheet of letter-size paper, then you’re pretty much asking for it to have defects.
But Lauterbach saw an architectural solution: Because the workload they were targeting favors having thousands of small, identical cores, it was possible to fit in enough redundant cores to account for the defect-induced failure of even 1 percent of them and still have a very powerful, very large chip.
Of course, Cerebras still had to solve a host of manufacturing issues to build its defect-tolerant giganto chip. For example, photolithography tools are designed to cast their feature-defining patterns onto relatively small rectangles, and to do that over and over. That limitation alone would keep a lot of systems from being built on a single wafer, because of the cost and difficulty of casting different patterns in different places on the wafer.
But the WSE doesn’t require that. It resembles a typical wafer full of the exact same chips, just as you’d ordinarily manufacture. The big challenge was finding a way to link those pseudochips together. Chipmakers leave narrow edges of blank silicon called scribe lines around each chip. The wafer is typically diced up along those lines. Cerebras worked with Taiwan Semiconductor Manufacturing Co. (TSMC) to develop a way to build interconnects across the scribe lines so that the cores in each pseudochip could communicate.
With all communications and memory now on a single slice of silicon, data could zip around unimpeded, producing a core-to-core bandwidth of 1,000 petabits per second and an SRAM-to-core bandwidth of 9 petabytes per second. “It’s not just a little more,” says Feldman. “It’s four orders of magnitude greater bandwidth, because we stay on silicon.”
Scribe-line-crossing interconnects weren’t the only invention needed. Chip-manufacturing hardware had to be modified. Even the software for electronic design automation had to be customized for working on such a big chip. “Every rule and every tool and every manufacturing device was designed to pick up a normal-sized chocolate chip cookie, and [we] delivered something the size of the whole cookie sheet,” says Feldman. “Every single step of the way, we have to invent.”
Wafer-scale integration “has been dismissed for the last 40 years, but of course, it was going to happen sometime,” he says. Now that Cerebras has done it, the door may be open to others. “We think others will seek to partner with us to solve problems outside of AI.”
Indeed, engineers at the University of Illinois and the University of California, Los Angeles, see Cerebras’s chip as a boost to their own wafer-scale computing efforts using a technology called silicon-interconnect fabric [see “Goodbye, Motherboard. Hello, Silicon-Interconnect Fabric,” IEEE Spectrum, October 2019]. “This is a huge validation of the research we’ve been doing,” says the University of Illinois’s Rakesh Kumar. “We like the fact that there is commercial interest in something like this.”
The CS-1 is more than just the WSE chip, of course, but it’s not much more. That’s both by design and necessity. What passes for the motherboard is a power-delivery system that sits above the chip and a water-cooled cold plate below it. Surprisingly enough, it was the power-delivery system that was the biggest challenge in the computer’s development.
The WSE’s 1.2 trillion transistors are designed to operate at about 0.8 volts, pretty standard for a processor. There are so many of them, though, that in all they need 20,000 amperes of current. “Getting 20,000 amps into the wafer without significant voltage drop is quite an engineering challenge—much harder than cooling it or addressing the yield problems,” says Lauterbach.
Power can’t be delivered from the edge of the WSE, because the resistance in the interconnects would drop the voltage to zero long before it reached the middle of the chip. The answer was to deliver it vertically from above. Cerebras designed a fiberglass circuit board holding hundreds of special-purpose chips for power control. One million copper posts bridge the millimeter or so from the fiberglass board to points on the WSE.
Delivering power in this way might seem straightforward, but it isn’t. In operation, the chip, the circuit board, and the cold plate all warm up to the same temperature, but they expand when doing so by different amounts. Copper expands the most, silicon the least, and the fiberglass somewhere in between. Mismatches like this are a headache in normal-size chips because the change can be enough to shear away their connection to a printed circuit board or produce enough stress to break the chip. For a chip the size of the WSE, even a small percentage change in size translates to millimeters.
“The challenge of [coefficient of thermal expansion] mismatch with the motherboard was a brutal problem,” says Lauterbach. Cerebras searched for a material with the right intermediate coefficient of thermal expansion, something between those of silicon and fiberglass. Only that would keep the million power-delivery posts connected. But in the end, the engineers had to invent one themselves, an endeavor that took a year and a half to accomplish.
In 2018, Google, Baidu, and some top academic groups began working on benchmarks that would allow apples-to-apples comparisons among systems. The result, MLPerf, released training benchmarks in May 2018.
According to those benchmarks, the technology for training neural networks has made some huge strides in the last few years. On the ResNet-50 image-classification problem, the Nvidia DGX SuperPOD—essentially a 1,500-GPU supercomputer—finished in 80 seconds. It took 8 hours on Nvidia’s DGX-1 machine (circa 2017) and 25 days using the company’s K80 from 2015.
Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead the company prefers to let customers try out the CS-1 using their own neural networks and data.
This approach is not unusual, according to analysts. “Everybody runs their own models that they developed for their own business,” says Karl Freund, an AI analyst at Moor Insights. “That’s the only thing that matters to buyers.”
Early customer Argonne National Labs, for one, has some pretty intense needs. In training a neural network to recognize, in real time, different types of gravitational-wave events, scientists recently used one-quarter of the resources of Argonne’s megawatt-consuming Theta supercomputer, the 28th most powerful system in the world.
Cutting power consumption down to mere kilowatts seems like a key benefit in supercomputing. Unfortunately, Lauterbach doubts that this feature will be much of a selling point in data centers. “While a lot of data centers talk about [conserving] power, when it comes down to it…they don’t care,” he says. “They want performance.” And that’s something a processor nearly the size of a dinner plate can certainly provide.
This article appears in the January 2020 print issue as “Huge Chip Smashes Deep Learning’s Speed Barrier.”
People often think that Moore’s Law is all about making smaller and smaller transistors. But these days, a lot of the difficulty is squeezing in the tangle of interconnects needed to get signals and power to them. Those smaller, more dense interconnects are more resistive, leading to a potential waste of power. At the IEEE International Electron Devices Meeting in December, Arm engineers presented a processor design that demonstrates a way to reduce the density of interconnects and deliver power to chips with less waste.
Compared with the company’s 7-nanometer process, used to make iPhone X processors among other high-end systems, N5 leads to devices that are 15 percent faster and 30 percent more power efficient. It produces logic that is 1.84 times as small as the previous process and produces SRAM cells that are only 0.021 square micrometers, the most compact ever reported, Yeap said.
The process is currently in what’s called risk production—initial customers are taking a risk that it will work for their designs. Yeap reported that initial average SRAM yield was about 80 percent and that yield improvement has been faster for N5 than any other recent process introduction.
Some of that yield improvement is likely due to the use of extreme ultraviolet lithography (EUV). N5 is the first TSMC process designed around EUV. The previous generation was developed first using the established 193-nanometer immersion lithography first, and then when EUV was introduced, some of the most difficult to produce chip features were made with the new technology. Because it uses a 13.5-nanometer light instead of 193-nanometers, EUV can define chip features in one step—compared with three or more steps using 193-nanometer light. With more than 10 EUV layers, N5 is the first new process “in quite a long time” that uses fewer photolithography masks than its predecessor, Yeap said.
Part of the performance enhancement comes from the inclusion, for the first time in TSMC’s process, of a “high-mobility channel”. Charge carrier mobility is the speed with which current moves through the transistor, and therefore limits how quickly the device can switch. Asked (several times) about the makeup of the high-mobility channel, Yeap declined to offer details. “Those who know, know,” he said, prompting laughter from the audience. TSMC and others have explored germanium-based channels in the past. And earlier in the day, Intel showed a 3D process with silicon NMOS on the bottom and a layer of germanium PMOS above it.
Yeap would not even be tied down on which type of transistor, NMOS or PMOS or both, had the enhanced channel. However, the latter is probably not very mysterious. Holes generally travel more slowly through silicon devices than electrons and therefore the PMOS devices would benefit from enhanced mobility. When pressed Yeap confirmed that only one variety of device had the high-mobility channel.
At the IEEE International Electron Devices Meeting in San Francisco this week, Intel is unveiling a cryogenic chip designed to accelerate the development of the quantum computers they are building with Delft University’s QuTech research group. The chip, called Horse Ridge for one of the coldest spots in Oregon, uses specially-designed transistors to provide microwave control signals to Intel’s quantum computing chips.
The quantum computer chips in development at IBM, Google, Intel, and other firms today operate at fractions of a degree above absolute zero and must be kept inside a dilution refrigerator. However, as companies have managed to increase the number of quantum bits (qubits) in the chips, and therefore the chips’ capacity to compute, they’ve begun to run into a problem. Each qubit needs its own set of wires leading to control and readout systems outside of the cryogenic container. It’s already getting crowded and as quantum computers continue to scale—Intel’s is up to 49 qubits now—there soon won’t be enough room for the wires.
At Supercomputing 2019 in Denver, Colo., Cerebras Systems unveiled the computer powered by the world’s biggest chip. Cerebras says the computer, the CS-1, has the equivalent machine learning capabilities of hundreds of racks worth of GPU-based computers consuming hundreds of kilowatts, but it takes up only one-third of a standard rack and consumes about 17 kW. Argonne National Labs, future home of what’s expected to be the United States’ first exascale supercomputer, says it has already deployed a CS-1. Argonne is one of two announced U.S. National Laboratories customers for Cerebras, the other being Lawrence Livermore National Laboratory.
The U.S. is investing in upgrades to the fabrication facility that makes the radiation-hardened chips for its nuclear arsenal. It is also spending up to US $170-million to enhance the capabilities of SkyWater Technology Foundry, in Bloomington, Minn., in part to improve the company’s radiation-hardened-chip line for other Defense Department needs.
Scientists and engineers in Switzerland and California have come up with a technique that can reveal the 3D design of a modern microprocessor without destroying it.
Typically today, such reverse engineering is a time-consuming process that involves painstakingly removing each of a chip’s many nanometers-thick interconnect layers and mapping them using a hierarchy of different imaging techniques, from optical microscopy for the larger features to electron microscopy for the tiniest features.
The inventors of the new technique, called ptychographic X-ray laminography, say it could be used by integrated circuit designers to verify that manufactured chips match their designs, or by government agencies concerned about “kill switches” or hardware trojans that could have secretly been added to ICs they depend on.
“It’s the only approach to non-destructive reverse engineering of electronic chips—[and] not just reverse engineering but assurance that chips are manufactured according to design,” says Anthony F. J. Levi, professor of electrical and computer engineering at University of Southern California, who led the California side of the team. “You can identify the foundry, aspects of the design, who did the design. It’s like a fingerprint.”
Silicon Valley-based D2S revealed last week that it had solved the last problem in a nascent technique called inverse lithography technology, or ILT. The breakthrough could speed the process of making chips and allow semiconductor fabs to produce more advanced chips without upgrading equipment. The solution, a custom-built computer system, reduces the amount of time needed for a critical step from several weeks to a single day.
In most of the photolithography used to make today’s microchips, light with a wavelength of 193-nanometers is shown through lenses and a patterned photomask, so that the pattern is shrunk down and projected onto the silicon wafer where it defines device and circuit features. (The most modern chip making technology, extreme ultraviolet lithography, works a bit differently. But, only a few chipmakers have these tools.)
Ayer, Mass.-based American Superconductor Corporation (AMSC) will build a generator based on high-temperature superconductors (HTS), which lose their resistance at around 77 Kelvins. General Electric Research, in Niskayuna, N.Y. will develop a generator using low-temperature superconductors, leveraging its experience using liquid helium to cool the materials below 10 K in MRI machines.
There are a lot of different types of quantum computers. Arguably, none of them are ready to make a difference in the real world. But some startups are betting that they’re getting so close that it’s time to make it easy for regular software developers to take advantage of these machines. Boston-based Aliro Technologies is one such startup.
Aliro emerged from stealth mode today, revealing that it had attracted US $2.7 million from investors that include Crosslink Ventures, Flybridge Capital Partners, and Samsung NEXT’s Q Fund. The company was founded by Harvard assistant professor of computational materials science, Prineha Narang, along with two of her students, and a post-doctoral researcher.
Aliro plans a software stack that will allow ordinary developers to first determine whether available cloud-based quantum hardware can speed any of their processes better than other accelerators, such as GPUs. And then it will allow them to write code that takes advantage of that speedup.
Multifunctional metal-oxide semiconductor used to build flexible RRAM, transistors, and sensors
Engineers at the University of Houston are trying to make the melding of humans and machines a little easier on the humans. They’ve developed an easy-to-manufacture flexible electronics patch that, when attached to a human, translates the person’s motion and other commands to a robot and receives temperature feedback from the robot.
Led by University of Houston assistant professor Cunjiang Yu, the team developed transistors, RRAM memory cells, strain sensors, UV-light detectors, temperature sensors, and heaters all using the same set of materials in a low-temperature manufacturing process. They integrated the different devices into a 4-micrometer-thick adhesive plastic patch.
A paper describing the Houston researchers’ work appears this week in Science Advances.
With the patch on the back of a volunteer’s hand, the researchers were able to control a robot hand—causing it to close or open according to what the human’s hand motion did to the patch’s strain sensors. What’s more, they were able to close the human-robot control loop by providing temperature feedback from the robotic hand to the human one using the patch’s integrated heater circuits.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.