The goal of this work is to efficiently identify visually similar patterns from a pair of images, e.g. identifying an artwork detail copied between an engraving and an oil painting, or matching a night-time photograph with its daytime counterpart. Lack of training data is a key challenge for this task. We present a simple yet surprisingly effective approach to overcome this difficulty: we generate synthetic training pairs by selecting object segments in an image and copy-pasting them into another image. We then learn to predict the repeated object masks. We find that it is crucial to predict the correspondences as an auxiliary task and to use Poisson blending and style transfer on the training pairs to generalize on real data. We analyse results with two deep architectures relevant to our joint image analysis task: a transformer-based architecture and Sparse Nc-Net, a recent network designed to predict coarse correspondences using 4D convolutions. We show our approach provides clear improvements for artwork details retrieval on the Brueghel dataset and achieves competitive performance on two place recognition benchmarks, Tokyo247 and Pitts30K. We then demonstrate the potential of our approach by performing object discovery on the Internet object discovery dataset and the Brueghel dataset.


Video (11mins)


  • We train our cross-image transformer or Sparse Nc-Net on the generated pairs. Both networks jointly predict masks and correspondences.

  • Retrieval results. More results can be found in the supplementary material.

    • Retrieval results on Brueghel. Green bounding-boxes are one-shot detection results.
    • teaser
    • Retrieval results on Tokyo247.
    • teaser
    • Retrieval results on Pitts30K.

    • Discovery results. More results can be found in the supplementary material.

      • Discovery results on Brueghel.

      • Code and Paper

        To cite our paper,

          title={Learning Co-segmentation by Segment Swapping for Retrieval and Discovery},
          author={Shen, Xi and Efros, Alexei A and Joulin, Armand and Aubry, Mathieu},


        This work was supported in part by ANR project EnHerit ANR-17-CE23-0008, project Rapid Tabasco, and IDRIS under the allocation AD011011160R1 made by GENCI.