It may be cold comfort to us humans, but children still outshine machines when learning about the world.
Taking inspiration from infants, scientists at Google's DeepMind have created a system that can start to teach itself to see and understand a spatial environment.
Published today in Science, the algorithm is an important step for computer vision research. This field aims to give these tools the power to analyse image and video, much like a human can.
So far, this type of work has relied mostly on giving machine learning systems a big helping hand.
Often this involves training them with tens of thousands of labelled images — "this is a car", "this is a ball", "this is a table".
The DeepMind system figures out by itself the best way to interpret a scene.
Dubbed a Generative Query Network (GQN), the system gets rid of labels and focuses on what's known as unsupervised learning.
In the GQN, two networks play a "cooperative game".
The first network looks at an image and tries to describe it in the most concise way possible.
The second network takes that description and tries to predict what the scene would look like from a different viewpoint.
When an incorrect prediction is made, the two networks update themselves to reduce the chance of making the same mistake in the future.
The team tested the network on rooms with multiple objects, as well as in maze-like environments.
Michael Milford, a robotics professor at Queensland University of Technology who was not involved with the study (but who has worked with Google DeepMind), said the network had significant potential.
Computer vision research often focuses on creating algorithms that can perform specialised tasks, he explained: controlling self-driving cars or homecare robots, for instance.
A system that can learn for itself how to represent the world could be much more flexible. But, Dr Milford pointed out, the research has so far only been done in the lab and looked at simple 3D scenes. Whether it can survive the real world is a question for a later date.
"When you take them out into the real world, it's dark, it's raining, there's dust in the air, there's people running around in front of the autonomous car. Obviously, it's a lot tougher," he said.
The GQN avoids using labelled images to learn about the world, so it may not immediately know the names of certain objects or interpret them in a way that humans understand — this is where babies come in.
"Think of how an infant might interact with a laptop to begin with," Mr Eslami said.
"First it might think, 'that is just a collection of boxes' … but over time, if it sees the laptop enough … it will figure out that actually this thing, this configuration of boxes, appears to happen quite a lot, and maybe it's a concept of its own."
This is when a parent might step in, and explain that humans call that thing a "laptop".
This way of learning about a scene is not only much more efficient, but it also has the potential to create systems that have a much deeper understanding of the world, according to the researchers.
Traditional machine learning systems are not typically able to answer "what if" questions, DeepMind researcher Danilo Rezende explained.
They just answer a very specific question: Is there a person in the image? Is there a dog in the image?
"It cannot answer: 'What if I change something in the image?' or 'What if I see from this other perspective, what should I see?'" he added.
The GQN could help build machine learning systems that can answer this type of question, but if machines are teaching themselves, we face new challenges, Dr Milford said.
It becomes harder to understand the internal processes of these complicated systems — how is the system perceiving the world? Why is it making the choices it does?
"Even if you're not telling it what to do, it is going to learn different views of the world, so we just need to be conscious of that, especially as they get more and more complex."