Autoregressive models  factorize the joint distribution over structured outputs into products of conditional distribution. Unlike GANs , these can serve as powerful density estimators , are more stable during training [13,14], and can generalize well on held-out data. They have been successfully leveraged for modeling distributions across domains, such as images[5,12,13], video, or language , and our work explores their benefits across a broad range of 3D generation tasks.
Following their recent successes in autoregressive modeling[3,16] our work adapts a Transformer-based  architecture. However, these approaches cannot directly be adopted to volumetric 3D representations due to their high resolutions. We build on the work by van den Oord et.al.  who proposed a method to learn quantized and compact latent representations for images using Vector-Quantized Variational AutoEncoder (VQ-VAE). Inspired by Esser et.al.  who learned autoregressive generation over the discrete VQ-VAE representations, our work extends these ideas to the domain of 3D shapes.
Completing full shapes from partial inputs such as discrete parts, or single-view 3D, is an increasingly important task across robotics and graphics. Most recent approaches [1,4,18] formulate it as performing completion on point clouds and can infer plausible global shapes but have difficulty in either capturing fine-grained details, conditioning on sparse inputs, or generating diverse samples. Our work proposes an alternative approach using autoregressive shape priors.
Single View Reconstruction
Inferring the 3D shape from a single image is an inherently ill-posed task. Several approaches have shown impressive single-view reconstruction results using voxels [6,8], point clouds [11,19], and most recently implicit representations of 3D surfaces like SDFs [10,20] etc. However, these are often deterministic in nature and only generate a 3D single output. By treating image-based prediction as conditional distributions our work can capture the multi-modal aspect of conditional generation in a simple and elegant manner.
Language based Generation
Language is a highly effective and parsimonious modality for describing real world shapes and objects. Chen et.al  proposed a method to learn a joint text-shape embedding, followed by a GAN  based generator for synthesizing 3D from text. However, generating shapes from text is a fundamentally multi-modal task, and a GAN based approach struggles to capture the multiple output modes. In contrast, our project aims to first learn a ‘naive’ language guided conditional distribution and combine it with shape priors to generate diverse and plausible shapes.
 Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and Leonidas Guibas. Learning representations and generative models for 3d point clouds. In ICML, 2018.
 Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. In ACCV, 2018.
 Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In ICML, 2020.
 Xuelin Chen, Baoquan Chen, and Niloy J Mitra. Unpaired point cloud completion on real scans using adversarial training. In ICLR, 2020.
 Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. Pixelsnail: An improved autoregressive generative model. In ICML, 2018
 Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV, 2016.
 Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high resolution image synthesis. In CVPR, 2021.
 Shubham Tulsiani, Tinghui Zhou, Alexei A. Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In CVPR,2017.
 Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NeurIPS, 2014.
 Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In CVPR, 2020.
 Priyanka Mandikal, Navaneet K. L., Mayank Agarwal, and Venkatesh Babu Radhakrishnan. 3d-lmnet: Latent embedding matching for accurate and diverse 3d point cloud reconstruction from a single image. In BMVC, 2018
 Aaron Van Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In ICML, 2016
 Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P. Kingma. Pixelcnn++: A pixelcnn implementation with discretized logistic mixture likelihood and other modifications. In ICLR, 2017
 Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neural autoregressive density-estimator. In NeurIPS, 2013.
 Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. In NeurIPS, 2017.
 Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. NeurIPS, 2019.
 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017
 Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In ICCV, 2021.
 Rundi Wu, Yixin Zhuang, Kai Xu, Hao Zhang, and Baoquan Chen. Pq-net: A generative part seq2seq network for 3d shapes. In CVPR, 2020.
 Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In NeurIPS, 2019.