We may be very close to the rise of the first proto-AGIs alright.
DeepMind has finally started catching up to OpenAI's work, and may have even succeeded them in a few areas. OpenAI isn't the most advanced group, nor do they have the best minds and researchers. DeepMind snagged all the real quality. It was a quirk of fate that OpenAI chased after a far more fruitful methodology while DeepMind was left running after what seems to be dead-ends of the 2010s.
I've long guessed that, as soon as DeepMind accepted they lost the lead and followed the exciting path of large language modeling and world-knowledge modeling, they'd more than easily surpass OpenAI and perhaps reach general AI.
[2106.13884] Multimodal Few-Shot Learning with Frozen Language Models
Starspawn0's comments: Deepmind. I think I might have posted the tweet thread to this before. It's *amazing*. It's like the kind of thing you would expect from GPT-4 -- super-fast / few-shot learning of new visual-and-text combined skills.
So what do we expect from GPT-4?? We might expect it to have few-shot capability, whereby you can show it an image, and then teach it a new task on-the-fly. For example, maybe it's an analogy task: {image} is to X as Y is to ....? [fill in the blank], and it quickly learns to output Z (where Z is the correct answer). Or, you can maybe teach it to play chess -- you show it a board, and say, "white to move," and it gives a decent move. Maybe you need to give it a few examples, first, so that it gets the idea of what you want it to do -- just like the few-shot learning in GPT-3; except
here it's with text and images combined.
What's missing is the image-
synthesis. That's what OpenAI's DallE is all about. If you combine what DallE can deliver with the model in this Deepmind paper, and then scale it up
way, way up, you'll have something
mind-blowing. So, take that chess example: instead of
you always supplying the board for it to decide the next move,
it could also generate the board! A sufficiently powerful version of this would
literally allow you to create a chess game on-the-fly, just by giving it a few examples.
You could even make up a whole
new board game, and teach it how to play with some examples, and then it would maybe do a passable, amateur-level job as your opponent -- and would even generate subsequent game boards for you.
Just think of the business applications. You could show it some graphs and ask if there is anything that "stands out", and it might generate a paragraph or two -- and it would use its world-knowledge about other companies, industries, supply chains, and so on, to give a
plausible answer.
Or maybe you're a student in a chemistry class. You took some hand-written notes about some of the molecules the teacher drew at the board. You could show it one of your drawings, and ask it some questions about it. Maybe you made a mistake, and ask it to correct -- and it will do that, similar to doing "grammar correction".
Addendum: Take a look at the example in Figure 1. It's
amazing that it knew to map Macaulay Culkin's scream pose to a scream emoji. Look also at Figure 4 -- learns on the fly.
I haven't read it through that deeply yet, but it doesn't seem they are revealing what language model they used -- I could be totally wrong, though. They say, on page 13 in A.2:
The pretrained transformer language model we used has a GPT-like architecture [29]. It consists of a series of identical residual layers, each comprised of a self-attention operation followed by a positionwise MLP. The only deviation from the architecture described as GPT-2 is the use of relative position encodings [36]. Our seven billion parameter configuration used 32 layers, with each hidden layer having a channel dimensionality of 4096 hidden units. The attention operations use 32 heads each with key/value size dimensionality of 128, and the hidden layer of each MLP had 16384 hidden units. The 400 million parameter configuration used 12 layers, 12 heads, hidden dimensionality of 1536, and 6144 units in the MLP hidden layers.
They trained their own GPT-2??