This Economist article asserts that huge “foundation models” are turbo-charging AI progress. But, they can have abilities their creators did not foresee.
The “Good Computer” which Graphcore, a British chip designer, intends to build over the next few years might seem to be suffering from a ludicrous case of nominal understatement. Its design calls for it to carry out 10^19 calculations per second.
If your laptop can do 100 billion calculations a second - which is fair for an average laptop - then the "Good Computer" will be 100 million times faster. That makes it ten times faster than Frontier, a behemoth at America’s Oak Ridge National Laboratory which came out on top of the most recent “Top500” list of powerful supercomputers and cost $600m. Its four-petabyte memory will hold the equivalent of 2 trillion pages of printed text, or a pile of A4 paper high enough to reach the Moon. “Good” hardly seems to cut it.
But why the name “Good Computer?” According to the founders, they chose the name after an influential Englishman (in the same way John Snow Labs chose theirs). Jack Good was a codebreaker in WWII and worked with Alan Turing. After the war, he continued somewhat in obscurity in computer science until 1965, when he wrote Speculations concerning the first ultra-intelligent machine, by merging machine and human intelligence into devices that are smarter than we are. He also predicted, as many have, that this would be an extinction event for human beings. Graphcore intends for its Good Computer to be the ultraintelligent machine Good envisioned, but hopefully not the end of humanity. That would be a remarkably poor business strategy.
To do this, one has to imagine artificial intelligence (AI) models with an inconceivable number of coefficients applied to different calculations within the program. In 2018 BERT, the Bidirectional Encoder Representations from Transformers, a transformer-based machine learning technique for natural language processing, was developed by Google with 100 million parameters. Today's largest language models are already four orders of magnitude larger than BERT, reportedly with over a trillion variables. The Good computer’s incredibly ambitious specifications are driven by the desire to run programs with something like 500 trillion parameters.
Is it worth it? There was heated debate about whether growing huge models were reaching diminishing returns. BERT put that argument to rest. The larger the model, the more data they get better and better.
A funny thing about computers is that they can do almost impossible things easily, yet fall flat on the simplest. Consider pocket calculators introduced fifty years ago. They could do arithmetic with ease because they were designed for that task. You might wonder, can a large language model figure out the sum of two integers without any instruction in arithmetic?
How does Graphcore expect to do this? 3D wafer sacking
There is a ton of material on the Graphcore site about how all of this works, especially the “3D stacking” of chips. Designers see this as a way to jam more capacity into a computer by getting components closer together in 3D instead of 2D.
An interesting lesson here is that packaging innovation rises as perhaps the most important innovation, maybe even more than transistor innovation. Graphcore has been working with Taiwan Semiconductor Manufacturing Co to get a more even power supply to the IPU and, significantly drop the voltage on its circuits. That pushes the clock frequency and somehow miraculously burning less power.
Simon Knowles, chief technology officer at Graphcore, says that the SoIC-WoW approach to wafer stacking is a little bit different from the chip-on-wafer stacking that AMD is using with its impending “Milan-X” Epyc 7003 generation CPUs
Caution: really techy stuff here. “Wafer on wafer is a more sophisticated technology,” he explains.
What it delivers is a much higher interconnect density between dies that are stacked on top of each other. As its name implies, it involves bonding wafers together before they are sawn. In the Bow IPU, we attach a second wafer to our processor wafer, and this second wafer carries a very large number of deep trench capacitor cells. This allows us to smooth the power delivery to the device, which in turn allows us to run the device faster and at lower voltage. So we get both higher performance and a better power efficiency. There are two enabling technologies behind wafer-on-wafer stacking. One is a type of bonding between two pieces of silicon from back end of line to back end of line. And you can think of this as a sort of cold weld between pads on the silicon devices.
There are no interstitial bumps or anything like that. The second technology is a type of through silicon via called the backside through silicon via, or BTVS, and this allows power or signals to be connected through one of the wafers to the other. The effect of these two together is to deliver a very good power supply to the logic transistors on the logic die. This is just the first step for us. We have been working with TSMC to master this technology, and we use it initially to build a better power supply for our processor, but it will go much further than that in the near future.
The human brain has roughly 100 billion neurons that crate over 100 trillion synapses. The largest models today, most of which run on GPU clusters, top out at around 1 trillion parameters.
Who knows if such machines will ultimately be used for good or evil (probably both)? Considering the Good machine will accommodate five times (or more) parameters as the human brain has synapses will require more than eight thousands of its still-in-development IPUs and over ten-to-twenty exaflops of performance. Imagine over 4 PB of on-chip memory and over 10 PB/sec of memory bandwidth across those IPUs.
Cost? Uncertain, but it's anticipated to cost $120 million, which is a lot more expensive than a human brain but a bargain compared to the current class of supercomputers such as exascale floating point math machine used to run HPC simulations or models, whose costs can exceed $1 billion all-in.
I’ve been skeptical about this idea of super-intelligent machines and AGI (Artificial General intelligence). Now I’m just getting worried.