Media

Shaping Belief States with Generative Environment Models for RL

 

Screenshot 2019-06-25 at 12.30.47

Diagram of the agent and model. The agent receives observations $x$ from the environment, processes them through a feed-forward residual network (green) and forms a state using a recurrent network (blue), online. This state is a belief state and is used to calculate policy and value as well as being the starting point for predictions of the future. These are done using a second recurrent network (orange) – a simulation network (SimCore) that simulates into the future seeing only the actions. The simulated state is used to conditioning for a generative model (red) of a future frame.

 

Video: Shaping Belief States with Generative Environment Models for RL

 

Arxiv preprint: https://arxiv.org/abs/1906.09237

keywords: Reinforcement Learning, RL, belief-state models, generative models, world models, auxiliary losses, self-supervised learning, SLAM, memory.

Abstract:

When agents interact with a complex environment, they must form and maintain beliefs about the relevant aspects of that environment. We propose a way to efficiently train expressive generative models in complex environments. We show that a predictive algorithm with an expressive generative model can form stable belief-states in visually rich and dynamic 3D environments. More precisely, we show that the learned representation captures the layout of the environment as well as the position and orientation of the agent. Our experiments show that the model substantially improves data-efficiency on a number of reinforcement learning (RL) tasks compared with strong model-free baseline agents. We find that predicting multiple steps into the future (overshooting), in combination with an expressive generative model, is critical for stable representations to emerge. In practice, using expressive generative models in RL is computationally expensive and we propose a scheme to reduce this computational burden, allowing us to build agents that are competitive with model-free baselines.

Short videos of trained agents playing:

cliffv2.gif

 


 

Generative Query Networks

Generative Query Networks artwork

 

 

Video: Neural Scene Representation and Rendering

Science Article: Neural Scene Representation and Rendering

DeepMind Blog Post: Neural Scene Representation and Rendering

Datasets: Datasets For Neural Scene Representation and Rendering

keywords: 3D scene understanding, vision, scene representation, variational inference, generative models, ConvDRAW, approximate inference, scene uncertainty.

Abstract:

Scene representation—the process of converting visual sensory data into concise descriptions—is a requirement for intelligent behavior. Recent work has shown that neural networks excel at this task when provided with large, labeled datasets. However, removing the reliance on human labeling remains an important open problem. To this end, we introduce the Generative Query Network (GQN), a framework within which machines learn to represent scenes using only their own sensors. The GQN takes as input images of a scene taken from different viewpoints, constructs an internal representation, and uses this representation to predict the appearance of that scene from previously unobserved viewpoints. The GQN demonstrates representation learning without human labels or domain knowledge, paving the way toward machines that autonomously learn to understand the world around them.


Unsupervised Learning of 3D Structure from Images

Proposed framework: Left: Given an observed volume or image x and contextual
information c, we wish to infer a corresponding 3D representation h (which can be a volume or a
mesh). This is achieved by modelling the latent manifold of object shapes via the low-dimensional
codes z. In experiments we will consider unconditional models (i.e., no context), as well as models
where the context c is class or one or more 2D views of the scene. Right: We train a context conditional
inference network (red) and object model (green). When ground-truth volumes are
available, they can be trained directly. When only ground-truth images are available, a renderer is
required to measure the distance between an inferred 3D representation and the ground-truth image.

NIPS2016 Article: Unsupervised Learning of 3D Structure from Images

keywords: 3D scene understanding, vision, scene representation, variational inference, generative models, ConvDRAW3D, Spatial Transformer, volumetric data, mesh, OpenGL, approximate inference, scene uncertainty.

Abstract:

A key goal of computer vision is to recover the underlying 3D structure from 2D observations of the world. In this paper we learn strong deep generative models of 3D structures, and recover these structures from 3D and 2D images via probabilistic inference. We demonstrate high-quality samples and report log-likelihoods on several datasets, including ShapeNet [2], and establish the first benchmarks in the literature. We also show how these models and their inference networks can be trained end-to-end from 2D images. This demonstrates for the first time the feasibility of learning to infer 3D representations of the world in a purely unsupervised manner.