Deep Sprite-based Image Models:
An Analysis

1LIGM, ENPC, IP Paris, Univ Gustave Eiffel, CNRS
ArXiv, 2026

Abstract

While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.


Overview

Overview

We decompose all sprite-based models in four main components: (1) a Sprite Generation Module that outputs $K$ sprites $S$, (2) a Transformation Module that takes as input an image $I$ and the sprites $S$ to predict transformed sprites $\bar{S}^I$, (3) a Decision Module that takes the image $I$ and transformed sprites $\bar{S}^I$ as input and outputs a probability distribution $p^I$ for using the sprites, and (4) a Training Criteria which consists of a reconstruction loss and potential regularization terms.


Design Choices


Key Findings


Results on CLEVR Benchmark

CLEVR

Figure: The three left columns show the sprites' appearances (Frg.), masks, and combination (Sprite), including the empty sprite, and the background. The other columns show for four different examples, the input image, its reconstruction, semantic segmentation (Sem. Seg.), instance segmentation (Ins. Seg.), background (Bkg. Layer), and the different transformed sprites (Object Layers).


Method class-aware CLEVR
mIoU$^\dagger$ ARI-FG$^\dagger$ mAcc avg-mIoU
MONet [Burgess et al. (2019)] 30.7±14.9 54.5±11.4 - -
IODINE [Greff et al. (2019)] 45.1±17.9 93.8±0.8 - -
SPAIR [Crawford & Pineau (2019)] 66.0±4.0 77.1±1.9 - -
GNM [Jiang & Ahn (2020)] 59.9±3.7 65.1±4.2 - -
Slot Attention [Locatello et al. (2020)] 36.6±24.8 95.9±2.4 - -
eMORL [Emami et al. (2021)] 50.2±22.6 93.3±3.2 - -
Genesis-V2 [Engelcke et al. (2021)] 9.5±0.6 57.9±20.4 - -
MarioNette [Smirnov et al. (2021)] 72.1±0.6 56.8±0.4 16.1±0.2 7.3±0.4
AST-Seg-B3-CT [Sauvalle & de La Fortelle (2023)] 90.3±0.2 98.3±0.1 20.8±1.2 12.1±0.2
DTI-Sprites [Monnier et al. (2021)] 54.5±1.2 93.2±2.0 69.8±4.5 55.7±6.0
Ours-D 53.8±0.3 95.1±0.5 70.6±0.2 55.3±0.2

Table: Comparisons for instance segmentation with standard deviation over 3 runs. Sources for $\dagger$ (excluding Monnier et al. [2021]): Karazija et al. [2021] and Sauvalle et al. [2023].

BibTeX

@misc{baltacı2026deepsprite,
      title={Deep sprite-based image models: An analysis}, 
      author={Zeynep Sonat Baltacı and Romain Loiseau and Mathieu Aubry},
      year={2026},
      eprint={2604.19480},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.19480}, 
}

Acknowledgement

This work was funded by the ANR project VHS ANR-21-CE38-0008, and the ERC project DISCOVER funded by the European Union's Horizon Europe Research and Innovation program under grant agreement No. 101076028. This work was granted access to the HPC resources of IDRIS under the allocation AD011015415R1, AD011015415, and AD011014404 made by GENCI. We would like to thank Ioannis Siglidis for insightful discussions, and Robin Champenois and Ségolène Albouy for their contributions to the codebase.