Do concepts emerge with language? Is a concept born with its annotation? We argue that the ability of humans to name objects with only few weak labels could be explained if visual concepts actually emerge before labels, to organize the visual world in a set of frequently occurring prototypes. The ability to identify the same object or similar objects under varying scales, poses, camera placements, illuminations and deformations is critical for visual concepts to emerge. We present a model that detects objects, quantizes them to prototypes, corresponds detected objects across scenes and improves its features over time us- ing these self-generated correspondences. It carries out these operations in a latent 3-dimensional visual feature space, inferred from the input RGB-D images using differentiable fully-convolutional inverse graphics architectures. This latent 3D feature space learns a featurization of the 3D world scene depicted in the images. As a result, it is invariant to cam- era viewpoint changes or zooms, and does not suffer from foreshortenings or cross-object occlusions. We represent our prototypes as 3-dimensional feature maps. They are rotated and scaled appropriately during quan- tization to explain object instances in a variety of 3D poses and scales. We show that 3D object detection, object-to-prototype quantization and correspondence inference improve over time and help one another. We demonstrate the usefulness of our framework in few-shot learning: one or few object labels suffice to learn a pose-aware 3D object detector for the object category.

Want more details on project? Visit this project page and this project page.