Aäron van den Oord 4 Jun 19
We use a hierarchical VQVAE which compresses images into a latent space which is about 50x smaller for ImageNet and 200x smaller for FFHQ Faces. The PixelCNN only models the latents, allowing it to spend its capacity on the global structure and most perceivable features.