Implicit Image Compression
CS766 Project, Spring 2021
- Megh Doshi
- Varun Sundar
- Zachary Huemann University of Wisconsin Madison
Abstract & Method
Implicit Neural Networks, being a continuous mapping, can serve as a compelling choice for representing a variety of commonly encountered 2D and 3D signals. In this project, we specifically consider the task of image compression via implicit networks. Owing to the over-parameterized nature of deep networks, a MLP may require more parameters than samples present in the original signal. Furthermore, the capacity of such networks often saturates with increasing width or depth. We explore two related directions: (a) efficiently increasing the capacity of implicit MLPs to fit natural images, and (b) reducing the storage requirement of such networks through a combination of sparsity, quantization and entropy coding.Motivation & Goals
A large proportion of recent success in a variety of computer vision (and graphics) problems has been attributed to implicitly defined representations parameterized by neural networks (typically a MLP). These include works on novel viewpoint rendering (Mildenhall et al., 2020; Martin-Brualla et al., 2020), image stabililization (Liu et al., 2021) and view-consistent image generation (Schwarz et al., 2020; Chan et al., 2020). Such MLPs replace traditional grid-based representations, and map low-dimensional coordinates to output quantities such as pixel intensities or densities. Their inherent continuous and differentiable nature makes these representations a compelling choice. Additionally, in the particular case of 3D points, such networks are often much more compact than grid-based representations. Following Tancik et al. (2020), we shall refer to such neural networks as “coordinate MLPs”.
In this work, we examine if these benefits can be carried over to the simpler 2D case of images. We consider applying coordinate MLPs for the task of lossy image compression by mapping 2D grid-locations(x,y)∈[0,1]2 to RGB intensities. By fitting a MLP, we transfer the task of compressing a grid of pixels to compressing the corresponding network’s weights. The representation is no longer limited by the grid-resolution but by the underlying network architecture. This however can be challenging since deep networks often have more parameters than data points itself. However, a wide body of research addresses the efficient storage and inference of deep networks, although generally targeted towards high-dimensional mapping tasks such as image classification.
Another challenge associated with coordinate MLPs is their diminishing increase in capacity with growing layer width and depth. This makes representing signals which are densely sampled (large resolution) or with finer detail difficult. Rebain et al. (2020) tackle this issue for the case of 3D points by decomposing the scene into soft Voronoidiagrams and dedicating smaller networks for each part. For images, frequency domain and wavelet decompositionsare potential candidates to achieve similar workarounds.
Target Questions
- Given a target image to fit, how can we train and efficiently store coordinate MLPs? Since image quality is usually sacrificed for storage space, we are interested in exploring the trade-off space for coordinate MLPs and comparing them to conventionally used image compression algorithms such as JPEG.
- For a fixed number of network parameters, can image decomposition help overcome the diminishing returns of naively scaling coordinate MLPs? Valuable insights could include understanding when such decompositions are useful and the range of image resolutions that can be represented.
Approach and Implementation
As illustrated in Figure 1, our pipeline first fits a coordinate MLP to an image, either directly or indirectly via a synthesis equation. We then prune and quantize this network, before storing its weights as a compressed sparse array. Although represented sequentially, we may choose to perform some of these steps jointly with training, e.g. directly train sparse MLPs instead of pruning post-training. Given the rich body of literature in training neural networks efficiently, we base our design choices on empirical evaluation presented in this section.
We use three 16-bit, uncompressed images from the Image Compresssion Benchmark: flowerfoveon, bigbuilding and bridge (Figure 2). In Figure 3, we present the results of JPEG compression on the three images. We train coordinate MLPs for 10,000 gradient steps each, using the Adam optimizer (Loshchilov and Hutter, 2019) and MSE loss as the objective. Where possible, we set the batch size equal to the total number of pixels—hence corresponding tofull-batch gradient descent. All our experiments are conducted on a NVIDIA GTX-1080 GPU with 8GB of VRAM.
Network Architecture. We compare two recently proposed architectures for enabling MLPs to better represent high-frequency detail in low-dimensional problems: SIREN (Sitzmann et al., 2020) and Fourier Features (Tancik et al.,2020). While SIREN uses sinusoidal activation functions with a particular weight initialization, Fourier Features—abreviated here as FFNet—uses a random Fourier embeddings (Rahimi and Recht, 2008) to increase input dimensionsprior to a ReLU MLP. As seen in Figure 4, for a given number of parameters, SIREN significantly outperforms FFNet in image fitting.
By considering the maximum PSNR (equivalently PSNR at quality 100) obtained by JPEG, we choose a SIREN network with depth 8 (or 6 hidden layers) and width 128 as our baseline MLP. We also observe that the minimal architecture that outperforms JPEG in PSNR can differ across images-a hidden layer width of 256 units is better suited for the bigbuilding and bridge images. Table 1 summarises the architecture, performance and storage space for the chosen baselines. The encoding time required per image is around 20 minutes (10,000 steps), while decoding time is much smaller, around 30 milliseconds, all reported with a GPU device.
We compare four different techniques to reduce parameter count, viz., Small-Dense, Feathermap, RigL and Pruning. Small-Dense involves reducing the hidden-layer width commensurately to achieve a target parameter count. Feathermap is a recently proposed structured hashing technique that represents the entire weights and biases of the MLP by a single matrix and then stores it via low-rank decomposition. Particularly, we find Feathermap to drastically hurt the representation power of the underlying SIREN network. Pruning here refers to iterative pruning (Zhu and Gupta,2018), where low-magnitude weights are gradually removed from a fully-connected MLP till the desired sparsity is achieved. RigL(Evci et al., 2020) instead directly trains sparse networks from scratch, with periodic growth, pruning and redistribution steps. Overall, we find RigL to be the best approach for lowering parameter-count without significant PSNR loss (Figure 5).
Low bit-rate via extreme sparsity. We qualitative illustrate the benefit of using sparse coordinate MLPs to achieve compression. As before, we use RigL to directly train sparse networks, but at much higher sparsity rates: 90% and 95%, corresponding to a parameter count reduction of 10× and 20× respectively. By shifting to int8 quantization, which reduces bits required by 4x, we can evaluate the theoretical bit-rates. As seen in Figure 6, even at high compression ratios, our approach retains most ofthe visual structure and does not suffer from block artefacts.
Wavelet Fitting. We attempt to increase the representation capacity of coordinate MLPs via wavelet decomposition. We use the Daubechies-3 wavelet to decompose an image and fit a MLP each to low-frequency and high-frequency components—both jointly optimized from scratch. For RGB images, we predict low-frequency outputs in the YCbCr space and then simply upsample the chroma components. Unfortunately, this approach does not confer any benefit over directly fitting an image (Figure 7).
Weight Quantization. Training with half precision (float16) or with surrogate quantization modules that simulate int8 precision can reduce the possible performance drop due to post-training quantization, while still maintaining full precision (float32) PSNR. Amongst post-training quantization techniques, we shall consider k-means or centroid based clustering, range based quantization and distribution based quantization. Han et al. (2015) finds that in the image-classification domain, fully-connected layers can be represented with just 5-bits—although we expect more bits (6-8) required for the significantly harder image-fitting problem. Yet another alternative is to use the SZ lossy algorithm (Di and Cappello, 2016), although this can be harder to implement (lots of carefully crafted stages, not widely supported).
Entropy Coding. Post pruning and quantization, we are left with a bunch of sparse matrices that need to be efficiently stored. We note that since no pruning is performed on the bias vectors, these can be represented as dense arrays. The stored matrices can now be compressed by a combination of common entropy coding techniques such as huffman encoding, LZ77 or the more recent proposed ZStandard. We found ZStandard to be most effective.
Here we compare our method against three existing codecs: JPEG, JPEG2000, and Webp. On natural images, we match or exceed the performance of JPEG2000 but do worse than Webp.
To demostrate the benefit of learnt represntations we consider images which are comprised mostly of
JPEG and JPEG2000 are not designed for this domain and perform poorly. Despite having a pipeline
to JPEG, our learnt representtations better perserves teh high frequency components and outperforms
the other approaches.
Here are some qualitative results in the low-bit rate regime. On the top row we compress the raw image by 150 times and on the bottom row by 80 times. Evaluated on PSNR, our method outperforms both JPEG variants, but falls short of Webp. We can also see block artifacts in both JPG and Webp since these are block compression methods. In contrast, JPEG2000 only contains ringing artifacts, as seen as blur and rings near the edges. Our approach does not have any of these major artifacts, but loses out on high-frequency detail at high compression rates.
