{"id":48,"date":"2022-04-30T18:06:28","date_gmt":"2022-04-30T22:06:28","guid":{"rendered":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/?page_id=48"},"modified":"2022-12-20T05:10:58","modified_gmt":"2022-12-20T10:10:58","slug":"technical-report","status":"publish","type":"page","link":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/","title":{"rendered":"Technical Report"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Abstract<\/h2>\n\n\n\n<p>Object pose estimation is a fundamental requirement for robotic manipulation tasks. Most methods that have high accuracy to be deployed in real-world industrial scenarios depend on known 3D object models like CAD models, etc. While these approaches may be sufficient for certain kinds of industries like manufacturing, these approaches cause bottlenecks in scenarios that deal with objects whose appearance and shape are constantly changing, like in the e-commerce warehouses. This project proposes a mechanism to automate this process of model-free pose estimation. We develop a pipeline that enables the system to learn the shape and appearance of novel objects and identify their poses without manual supervision or datasets. <\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Data<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"511\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-1024x511.png\" alt=\"\" class=\"wp-image-72\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-1024x511.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-300x150.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-768x383.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-920x459.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-230x115.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-350x175.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-480x239.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1.png 1237w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>sample data from Mujin<\/figcaption><\/figure>\n\n\n\n<p>The input data to the system consists of high-resolution RGB images captured from a top view along with left and right grayscale images captured by a stereo camera setup. Corresponding structured point clouds are also available. Along with these images, the camera intrinsics, and extrinsics are also provided.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Initial Experiments<\/h2>\n\n\n\n<h4 class=\"wp-block-heading\"><a href=\"https:\/\/github.com\/BerkeleyAutomation\/sd-maskrcnn\" target=\"_blank\" rel=\"noreferrer noopener\">SD-MRCNN<\/a><\/h4>\n\n\n\n<p>The first step to identify object poses is to identify individual object instances. We chose Synthetic Depth MaskRCNN (SDMRCNN) as our baseline model to evaluate current SOTA in instance recognition on Mujin dataset. We chose SDMRCNN since it is an MRCNN-based approach that utilizes depth data and classifies the objects in the scene as either foreground (objects) or background (tote\/container).<\/p>\n\n\n\n<p>The following images show sample results of SDMRCNN pre-trained on the WISDOM dataset and inferred on Mujin data.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"380\" height=\"347\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-3.png\" alt=\"\" class=\"wp-image-75\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-3.png 380w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-3-300x274.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-3-230x210.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-3-350x320.png 350w\" sizes=\"auto, (max-width: 380px) 100vw, 380px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"382\" height=\"342\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-4.png\" alt=\"\" class=\"wp-image-76\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-4.png 382w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-4-300x269.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-4-230x206.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-4-350x313.png 350w\" sizes=\"auto, (max-width: 382px) 100vw, 382px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"382\" height=\"343\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-5.png\" alt=\"\" class=\"wp-image-77\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-5.png 382w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-5-300x269.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-5-230x207.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-5-350x314.png 350w\" sizes=\"auto, (max-width: 382px) 100vw, 382px\" \/><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"378\" height=\"337\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-6.png\" alt=\"\" class=\"wp-image-78\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-6.png 378w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-6-300x267.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-6-230x205.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-6-350x312.png 350w\" sizes=\"auto, (max-width: 378px) 100vw, 378px\" \/><\/figure>\n<\/div>\n<\/div>\n\n\n\n<p>These results clearly show the model failed to identify object instances, and as the object complexity increases (no texture, reflection, transparency), the model completely fails to identify anything at all. Training SDMRCNN on these data samples from Mujin would result in better instance recognition, but that would require creating new datasets for these kinds of objects. However, since the object appearing in the warehouse constantly change, it is not feasible to keep creating new datasets. Therefore, we need a better alternative.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">Solution Pipeline<\/h2>\n\n\n\n<p>To overcome the issues mentioned above, we propose the following multi-stage pipeline to tackle novel object pose recognition.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"115\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-1024x115.png\" alt=\"\" class=\"wp-image-79\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-1024x115.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-300x34.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-768x87.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-920x104.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-230x26.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-350x39.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7-480x54.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-7.png 1073w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>For each state of the pipeline, we identify a baseline approach to perform the corresponding task. Once the full pipeline implementation is complete, we then focus on each stage to improve results.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Phase-1<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"134\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-1024x134.png\" alt=\"\" class=\"wp-image-145\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-1024x134.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-300x39.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-768x101.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-920x121.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-230x30.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-350x46.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36-480x63.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-36.png 1036w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Stage-1: First Pick<\/h3>\n\n\n\n<p>The first stage of the pipeline is to pick an object from the tote in a model-free approach. It is a costly (in terms of computing, memory, and time) process to be able to identify and pick an object without knowing the 3D model of the object. However, since this process is going to be performed only once for 1000s of objects that may follow, this costly first step is acceptable. <\/p>\n\n\n\n<p>Since there are existing solutions to perform this task, even on complex objects, (ex: <a href=\"https:\/\/berkeleyautomation.github.io\/fcgqcnn\/\">Fully Convolutional GQ-CNN by BerkeleyAutomation<\/a>), we do not tackle this problem in the current scope of the project and revisit it later once we complete the rest of the pipeline.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Stage-2: 3D Recognition <\/h3>\n\n\n\n<p>The goal of this stage is to recognize the 3D representation of the object picked up in stage-1. The resulting 3D model should accurately capture the object&#8217;s 3D geometry and textures. <\/p>\n\n\n\n<p>While there are several approaches that can be used to recognize and reconstruct the object&#8217;s 3D model, (ex: using a multi-view solution or 3D scans, etc), we use <a href=\"https:\/\/jasonyzhang.com\/ners\/\">NeRS (jasonyzhang.com)<\/a> as our baseline approach to generate the 3D model of the object.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">NeRS<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"829\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-1024x829.png\" alt=\"\" class=\"wp-image-80\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-1024x829.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-300x243.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-768x621.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-1536x1243.png 1536w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-920x744.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-230x186.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-350x283.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8-480x388.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-8.png 1540w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Data<\/h3>\n\n\n\n<p>To obtain the initial results (and avoid logistical issues with the camera and robot calibration), we create the input data for NeRS with a handheld object that represents the complexity of objects appearing in Mujin&#8217;s warehouse environments. The object masks for these images were created manually for ease of completing the pipeline. This is a trivial task and can be obtained from various automated methods later.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"823\" height=\"1024\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-823x1024.png\" alt=\"\" class=\"wp-image-83\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-823x1024.png 823w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-241x300.png 241w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-768x955.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-230x286.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-350x435.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9-480x597.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-9.png 862w\" sizes=\"auto, (max-width: 823px) 100vw, 823px\" \/><figcaption>Input images and corresponding mask for NeRS<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"245\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-1024x245.png\" alt=\"\" class=\"wp-image-84\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-1024x245.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-300x72.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-768x184.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-920x220.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-230x55.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-350x84.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10-480x115.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-10.png 1347w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Results<\/h3>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"578\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-1024x578.png\" alt=\"\" class=\"wp-image-87\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-1024x578.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-300x169.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-768x433.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-1536x867.png 1536w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-920x519.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-230x130.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-350x197.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11-480x271.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-11.png 1597w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"377\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-1024x377.png\" alt=\"\" class=\"wp-image-88\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-1024x377.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-300x110.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-768x283.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-920x338.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-230x85.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-350x129.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12-480x177.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-12.png 1446w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><a href=\"https:\/\/s3.us-west-2.amazonaws.com\/secure.notion-static.com\/acb34552-759d-46d9-b562-328a7aecdb02\/download_%282%29.mp4?X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&amp;X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220501%2Fus-west-2%2Fs3%2Faws4_request&amp;X-Amz-Date=20220501T000757Z&amp;X-Amz-Expires=86400&amp;X-Amz-Signature=3bfb72d91461f24d801cc03aec13afd587eb632f244b2e8decb286217e6eba0a&amp;X-Amz-SignedHeaders=host&amp;response-content-disposition=filename%20%3D%22download%2520%282%29.mp4%22&amp;x-id=GetObject\">Result: 3D model video<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Stage-3: Instance Recognition<\/h2>\n\n\n\n<p>In this stage, we generate a instance recognition model that has been fine-tuned on the object picked up in stage-1. We achieve this by generating a instance recognition dataset using the 3D object model from stage-2 and fine-tuning the instance recognition model (SDMRCNN) on this dataset. The resulting model should be able to perform much better than the pre-trained one shown in the initial experiments.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Generating dataset for instance recognition<\/h3>\n\n\n\n<p><strong>Naive approach <\/strong>&#8211; Generate single instance images from different views of the object. Then overlay and blend these images on a template container image. However, this is not good as the resulting image is not realistic. And blending the depth image of each object instance with a template depth map can lead to incorrect final depth maps.<\/p>\n\n\n\n<p><strong>Alternate &amp; better approach <\/strong>&#8211; Use a 3D rendered to render a 3D scene with multiple object meshes inside a template container mesh. The resulting image is more realistic and a custom rasterizer &amp; shader can be used to generate accurate depth maps and object instance masks.<\/p>\n\n\n\n<p>We use this approach and implement it with PyTorch3D as the rendering framework<\/p>\n\n\n\n<p><strong>Prerequisites<\/strong> <\/p>\n\n\n\n<ul class=\"has-cyan-bluish-gray-background-color has-background wp-block-list\"><li>A 3D mesh model of the target object from stage-2<\/li><li>A 3D mesh model of the template container. <\/li><li>Scene metadata<ul><li>camera settings &#8211; FOV, Z-near, Z-far, etc<\/li><li>Size of the container and distance from the camera<\/li><li>The relative size of each object w.r.t to the container size<\/li><li>etc.<\/li><\/ul><\/li><\/ul>\n\n\n\n<p><strong>Process<\/strong><\/p>\n\n\n\n<div class=\"wp-block-columns has-cyan-bluish-gray-background-color has-background is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<ol class=\"wp-block-list\"><li>Define a custom rasterizer and shader that can generate object depth maps, masks, and normal textures.<\/li><li>Place camera and lights at the origin. <\/li><li>Place a container at a distance \u2018z\u2019 oriented towards the camera. <\/li><li>Instantiate \u2018N\u2019 object meshes and apply random rotation transformation to each. <\/li><li>Identify \u2018N\u2019 points inside the container boundary that serves as the center for each of the \u2018N\u2019 objects.<\/li><\/ol>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<ol class=\"wp-block-list\" start=\"6\"><li>Place each of the \u2018N\u2019 objects at these \u2018N\u2019 points inside the container.<\/li><li>Render this scene as seen by the camera using the custom renderer from step1. This generates a single RGB image, depth map, and instance mask.<\/li><li>Repeat steps 4 to 7 to generate multiple data samples.<\/li><li>Generate appropriate train-test splits, camera intrinsics, and organize generate data as per the dataset needed for the instance identification model.<\/li><\/ol>\n<\/div>\n<\/div>\n\n\n\n<p><strong>Results<\/strong><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"alignleft size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-14.png\" alt=\"\" class=\"wp-image-92\" width=\"322\" height=\"207\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-14.png 558w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-14-300x193.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-14-230x148.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-14-350x225.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-14-480x309.png 480w\" sizes=\"auto, (max-width: 322px) 100vw, 322px\" \/><figcaption>3D Scene<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"824\" height=\"812\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15.png\" alt=\"\" class=\"wp-image-93\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15.png 824w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15-300x296.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15-768x757.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15-230x227.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15-350x345.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-15-480x473.png 480w\" sizes=\"auto, (max-width: 824px) 100vw, 824px\" \/><\/figure><\/div>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-1024x736.png\" alt=\"\" class=\"wp-image-94\" width=\"789\" height=\"566\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-1024x736.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-300x216.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-768x552.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-920x661.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-230x165.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-350x251.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16-480x345.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-16.png 1048w\" sizes=\"auto, (max-width: 789px) 100vw, 789px\" \/><figcaption>Individual instance masks<\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-17.png\" alt=\"\" class=\"wp-image-95\" width=\"265\" height=\"171\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-17.png 450w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-17-300x194.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-17-230x149.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-17-350x226.png 350w\" sizes=\"auto, (max-width: 265px) 100vw, 265px\" \/><figcaption>SDMRCNN Dataset<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-18.png\" alt=\"\" class=\"wp-image-96\" width=\"269\" height=\"174\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-18.png 450w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-18-300x195.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-18-230x149.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-18-350x227.png 350w\" sizes=\"auto, (max-width: 269px) 100vw, 269px\" \/><figcaption>Generated Dataset<\/figcaption><\/figure>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-19.png\" alt=\"\" class=\"wp-image-97\" width=\"267\" height=\"172\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-19.png 450w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-19-300x194.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-19-230x149.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-19-350x226.png 350w\" sizes=\"auto, (max-width: 267px) 100vw, 267px\" \/><figcaption>Scaled Dataset<\/figcaption><\/figure>\n<\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\">Finetuning Instance Recognition<\/h2>\n\n\n\n<p>Once the object dataset is generated, we then finetune our instance recognition model on this dataset. The following results show SDMRCNN recognition on this dataset containing 1000 images with 800 train and 200 test split.<\/p>\n\n\n\n<p><strong>Pretrained<\/strong>: SD-MRCNN obtained from original work implemented in TF and benchmarked in TF<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"238\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-1024x238.png\" alt=\"\" class=\"wp-image-98\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-1024x238.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-300x70.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-768x179.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-920x214.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-230x54.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-350x81.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20-480x112.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-20.png 1298w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><strong>Pytorch SD-MRCNN<\/strong>: Pytorch implementation of the original SD-MRCNN trained from scratch.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"241\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-1024x241.png\" alt=\"\" class=\"wp-image-99\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-1024x241.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-300x71.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-768x181.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-920x216.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-230x54.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-350x82.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21-480x113.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-21.png 1309w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>However, the results are far from satisfactory. The train-val loss graphs show the model is highly overfitting. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"361\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-1024x361.png\" alt=\"\" class=\"wp-image-100\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-1024x361.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-300x106.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-768x271.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-1536x542.png 1536w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-920x325.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-230x81.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-350x123.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22-480x169.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-22.png 1837w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>One of the main problems in the generated synthetic data is not realistic. Therefore, in phase-2 of the project, we focus on synthetic data generation to improve instance segmentation, along with revisiting the implementation details of stage-1<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Phase-2<\/h2>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"134\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-1024x134.png\" alt=\"\" class=\"wp-image-146\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-1024x134.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-300x39.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-768x101.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-920x121.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-230x30.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-350x46.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19-480x63.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-49-19.png 1036w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Revisiting Stage-1: The First Pick<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">What is &#8220;auto-registration&#8221; and &#8220;first-pick&#8221;?<\/h4>\n\n\n\n<p>Mujin&#8217;s existing solution for bin picking relies on object registration information to detect object instances first,<br>then it&#8217;s pose, and then finally the best regions of interest for the gripper to pick the object. While this solution<br>works, it cannot scale to scenarios that require thousands of object SKUs to be registered first and in scenarios where these object&#8217;s properties (shape, textures, etc) change over time. These kinds of scenarios forces the need to perform <strong>auto-registration<\/strong> of the object at first pick, i.e, when container comes into the cell for the first time, the robot is required to pick the object and perform registration automatically without any manual intervention and the subsequent picks can be performed efficiently using this registered information. To perform this autoregistration, the robot has to initially pick up the object for the first time without any prior information of the object in the container. We refer to this as the <strong>first-pick.<\/strong><\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Approach<\/h4>\n\n\n\n<p>we formulate the problem statement as a task of &#8220;Unseen Object Instance Segmentation (UOIS)&#8221; \u2013 given a single RGBD image object inside a container, the goal is to produce object instance segmentation masks for all the objects inside the container, where the object instances are arbitrary (but belongs to the same semantic class) and are not assumed to have been seen during a training phase.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">Relevant Papers<\/h5>\n\n\n\n<h6 class=\"wp-block-heading\">Segmentation with RGB-D<\/h6>\n\n\n\n<ol class=\"wp-block-list\"><li><a href=\"https:\/\/arxiv.org\/pdf\/2007.08073.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Unseen Object Instance Segmentation for Robotic Environments (2021)<\/a> &#8211; Introduces UOIS, UNet like baseline arch, late fusion<\/li><li><a href=\"https:\/\/arxiv.org\/pdf\/2007.15157.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation (CoRL 2021)<\/a> &#8211; UNet like arch (UCN &#8211; improves upon 1), late &amp; early fusion<\/li><li><a href=\"https:\/\/arxiv.org\/pdf\/2204.09847.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Unseen Object Instance Segmentation with Fully Test-time RGB-D Embeddings Adaptation (2022)<\/a> &#8211; UNet like arch (reuses 2), late fusion, improves models at test time<\/li><li><a href=\"https:\/\/arxiv.org\/pdf\/2109.11103v2.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling (ICRA 2022)<\/a> &#8211; MRCNN based arch, late fusion, improves handling occlusions<\/li><li>Category-agnostic Segmentation for Robotic Grasping (2022) &#8211; focuses on how to train with synthetic data, evaluates both MRCNN(4) and UNet(2) like arch, late fusion.<\/li><\/ol>\n\n\n\n<h6 class=\"wp-block-heading\">Synthetic Datasets<\/h6>\n\n\n\n<ol class=\"wp-block-list\"><li><a href=\"https:\/\/bop.felk.cvut.cz\/datasets\/\" target=\"_blank\" rel=\"noreferrer noopener\">BOP Dataset<\/a> &#8211; Benchmark and dataset for object detection and pose estimation<\/li><li><a href=\"https:\/\/github.com\/swtyree\/hope-dataset\" target=\"_blank\" rel=\"noreferrer noopener\">Nvidia HOPE Dataset<\/a> &#8211; Synthetic dataset of common 3D groceries objects<\/li><li><a href=\"https:\/\/zenodo.org\/record\/6103779#.YqG5F9JByZQ\" target=\"_blank\" rel=\"noreferrer noopener\">DoPose Dataset<\/a> &#8211; Synthetic dataset of common 3D groceries objects inside a container.<\/li><li><a href=\"https:\/\/ais-bonn.github.io\/stillleben\/\" target=\"_blank\" rel=\"noreferrer noopener\">Stillleben<\/a> &#8211; generates a realistic arrangement of rigid bodies to generate synthetic datasets.<\/li><\/ol>\n\n\n\n<h6 class=\"wp-block-heading\">Overview<\/h6>\n\n\n\n<p>The following solution is derived from the four papers mentioned above (Segmentation with RGB-D). 1 introduces the problem statement task. The main network architecture, training and testing procedures is<br>derived from 2. Few key ideas from 3, 4, and 5 are also incorporated in the proposed solution.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"664\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-1024x664.png\" alt=\"\" class=\"wp-image-111\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-1024x664.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-300x194.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-768x498.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-920x596.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-230x149.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-350x227.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36-480x311.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-26-36.png 1233w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Our FCN here is U-Net like network using a backbone network and a set of deconvolutional layers to generate the dense feature map. Different backbone networks can be used, ex: VGG, ResNet, etc. The paper implementation uses a ResNet-32-8s. We can reuse the same network and backbone for the initial implementation. For our final implementation, we can use the network design from Yolact++, which is a fast segmentation network designed to process images at real time.<\/p>\n\n\n\n<h6 class=\"wp-block-heading\">Learning RGB-D Feature Embeddings<\/h6>\n\n\n\n<ol class=\"wp-block-list\"><li>Given RGB image (H,W,3) and a depth image (H,W,1), back project the depth image into an organized point cloud (H,W,3) i.e, (x,y,z), using camera calibration params.<\/li><li>The FCN takes in the RGB image and point cloud image and generates a dense feature map (H,W,C) where C is the dimension of feature embeddings (as shown below). The FCN for RGB image uses a pretrained backbone weights, whereas the weights for point cloud image backbone is trained from scratch. The two feature maps are combined using an &#8220;addition&#8221; operation.<\/li><li>These feature embeddings are normalized to have unit length and a mean shift clustering algorithm is applied to group pixels using this latent space.<\/li><li>The network is trained with a metric learning loss function that ensure pixels belonging to the same object lie closer to each other and the pixels belonging to other objects are separated farther away. (background is treated is one of the objects)<ul><li><em><strong>Loss = loss_intra + loss_inter<\/strong><\/em><\/li><\/ul><\/li><li>The final clustering result provides the segmentation masks. These masks are coarse and are refined in the next step.<\/li><\/ol>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"543\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-1024x543.png\" alt=\"\" class=\"wp-image-114\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-1024x543.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-300x159.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-768x407.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-920x487.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-230x122.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-350x185.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34-480x254.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-49-34.png 1274w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h6 class=\"wp-block-heading\">Segmentation Mask Refinement<\/h6>\n\n\n\n<p>For objects lying close to each other, their corresponding pixels may be computed as a single cluster and, therefore, as a single instance mask. To improve such segmentation masks, a second-stage refinement network is employed to specifically handle objects lying close to each other or on top of each other.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"379\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-1024x379.png\" alt=\"\" class=\"wp-image-113\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-1024x379.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-300x111.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-768x285.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-920x341.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-230x85.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-350x130.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08-480x178.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-35-08.png 1212w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ol class=\"wp-block-list\"><li>For each cluster in the feature embedding space, a corresponding RoI is cropped from the RGB-D image.<\/li><li>A second network is trained on this cropped RoI image, and masks are generated similarly as above.<\/li><li>If RoI masks contain multiple objects, then only the objects that overlap larger than a predefined threshold with the original mask from stage 1 are retained.<\/li><li>The final segmentation labels for the whole image are computed by simply aggregating the segments from all ROIs.<\/li><\/ol>\n\n\n\n<h6 class=\"wp-block-heading\">Additional Improvements<\/h6>\n\n\n\n<p>Once the above method works reliably, the following improvement can be made.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Amodal instance segmentation masks can be generated by incorporating the ideas mentioned in paper 4<\/li><li>Better dataset creation and training to improve accuracy can be obtained from paper 5<\/li><li>Paper 3 provides methods to improve model accuracy at test time on real-world data.<\/li><\/ol>\n\n\n\n<h4 class=\"wp-block-heading\">UCN Expectations Vs Reality<\/h4>\n\n\n\n<p>The below images show the expected results with the UCN model. The feature map and the generated instance label clearly identify individual object instances.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"525\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-1024x525.png\" alt=\"\" class=\"wp-image-117\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-1024x525.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-300x154.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-768x394.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-920x472.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-230x118.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-350x179.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23-480x246.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-55-23.png 1274w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>However, UCN results on the real-world Mujin dataset look like the image below.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"215\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-1024x215.png\" alt=\"\" class=\"wp-image-118\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-1024x215.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-300x63.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-768x161.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-920x193.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-230x48.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-350x74.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19-480x101.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-02-59-19.png 1242w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The pre-trained model does not recognize any object instances. This is expected as the UCN model has never been pre-trained on Mujin data. So the obvious next step is to fine-tune the model on the Mujin dataset and perform inference. However, we do not have any real-world Mujin dataset. This creates a need to have a real-world synthetic dataset. Therefore, we generate a synthetic dataset, fine-tune on this synthetic dataset and then infer it on the Mujin dataset.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Synthetic Datasets Generation<\/h4>\n\n\n\n<ol class=\"wp-block-list\"><li>Papers mentioned in <em>dataset_papers<\/em> provide all the necessary initial datasets to evaluate the models and also tools to create new synthetic datasets.<\/li><li>All the datasets in this problem space follow the BOP dataset and benchmark format. BOP dataset also provides a set of scripts using BlenderProc to generate new synthetic datasets.<\/li><li>NVIDIA HOPE and DoPose provides dataset and 3D models of common groceries object, which we can use to generate more datasets as required.<\/li><li>These datasets mentioned above are standard public datasets containing multiple SKUs in a single scene. We need to generate our own single SKU synthetic dataset that models our projects.<\/li><\/ol>\n\n\n\n<h5 class=\"wp-block-heading\">Object Models<\/h5>\n\n\n\n<p>We use BlenderProc to generate our synthetic data following the BOP dataset standard. We use 25 rigid and box-shaped object models of small, medium, big, thin, and long object sizes. Some of the examples are as shown below.<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-1 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"119\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/sudoku_book.gif\" alt=\"\" class=\"wp-image-119\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"120\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/sticky_notes.gif\" alt=\"\" class=\"wp-image-120\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"124\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/stick_straw.gif\" alt=\"\" class=\"wp-image-124\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"121\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/spark_plug.gif\" alt=\"\" class=\"wp-image-121\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"122\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/papermate_pen.gif\" alt=\"\" class=\"wp-image-122\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"123\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/nelson_tea.gif\" alt=\"\" class=\"wp-image-123\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"125\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/jello_strawberry_deformed.gif\" alt=\"\" class=\"wp-image-125\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"127\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/jello_chocolate_deformed.gif\" alt=\"\" class=\"wp-image-127\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"128\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/jaffa_cakes_deformed.gif\" alt=\"\" class=\"wp-image-128\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"126\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/expo_eraser.gif\" alt=\"\" class=\"wp-image-126\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"129\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/domino_deformed.gif\" alt=\"\" class=\"wp-image-129\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"130\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/crayola.gif\" alt=\"\" class=\"wp-image-130\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"131\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/spaghetti.gif\" alt=\"\" class=\"wp-image-131\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"133\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/popcorn.gif\" alt=\"\" class=\"wp-image-133\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"134\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/granola.gif\" alt=\"\" class=\"wp-image-134\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"132\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/pudding.gif\" alt=\"\" class=\"wp-image-132\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"135\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/raisins.gif\" alt=\"\" class=\"wp-image-135\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"136\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/mac_cheese.gif\" alt=\"\" class=\"wp-image-136\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"137\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/cheese.gif\" alt=\"\" class=\"wp-image-137\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"139\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/cookies.gif\" alt=\"\" class=\"wp-image-139\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"640\" height=\"480\" data-id=\"138\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/butter.gif\" alt=\"\" class=\"wp-image-138\" \/><\/figure>\n<\/figure>\n\n\n\n<p>We provide configurable options to select scenes that control the camera, lighting, container, and the object and also select a packing type &#8211; ordered, semi-ordered, or random. Our generated dataset comprises of 13500 train set, 1100 validation set, and 3460 test set. Since the real-world data has a lot of noise and holes in the depth data, we post-process the synthetic data to remove surfaces pointing to the viewing camera to mimic the structured light depth sensor and make the synthetic data resemble real-world Mujin data. The resulting dataset sample is shown below. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"590\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-1024x590.png\" alt=\"\" class=\"wp-image-141\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-1024x590.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-300x173.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-768x442.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-920x530.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-230x132.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-350x201.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41-480x276.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-21-41.png 1242w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Our data generation framework is able to generate 25 images at 640&#215;480 resolution with segmentation masks using HDF5 writes in 63.7 seconds. This is a ~45% improvement over the original BlenderProc framework.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"510\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-1024x510.png\" alt=\"\" class=\"wp-image-148\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-1024x510.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-300x149.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-768x382.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-920x458.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-230x115.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-350x174.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50-480x239.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-51-50.png 1267w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Fine-Tuning UCN<\/h4>\n\n\n\n<p>Using the generated dataset, we fine-tune the pre-trained UCN model with 10K train images comprising of 9 object models and evaluate on 3.5K test images comprising 12 unseen models. The train &amp; test sample visualizations are shown below. <\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"937\" height=\"255\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59.png\" alt=\"\" class=\"wp-image-149\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59.png 937w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59-300x82.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59-768x209.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59-920x250.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59-230x63.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59-350x95.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-55-59-480x131.png 480w\" sizes=\"auto, (max-width: 937px) 100vw, 937px\" \/><figcaption>Sample training input data <\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"567\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-1024x567.png\" alt=\"\" class=\"wp-image-155\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-1024x567.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-300x166.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-768x425.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-920x510.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-230x127.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-350x194.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26-480x266.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-12-26.png 1056w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Training Plots for fine-tuning UCN on generated synthetic data<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"205\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-1024x205.png\" alt=\"\" class=\"wp-image-151\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-1024x205.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-300x60.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-768x154.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-920x185.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-230x46.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-350x70.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28-480x96.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-03-56-28.png 1231w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Sample test results on synthetic data<\/figcaption><\/figure>\n\n\n\n<p>As expected, the fine-tuned model works satisfactorily well on the synthetic dataset. Also, we observe the fine-tuned model works reasonably well on the real-world Mujin dataset as well. Here is an example showing the inference result between the fine-tuned and pre-trained model on a sample Mujin data.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"447\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-1024x447.png\" alt=\"\" class=\"wp-image-152\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-1024x447.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-300x131.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-768x336.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-920x402.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-230x100.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-350x153.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-480x210.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image.png 1275w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Improving fine-tuned UCN<\/h4>\n\n\n\n<p>Looking at the feature maps of UCN on Mujin data, we can see a lot of unwanted background activations going on. This can be attributed to the fact that the current loss only penalizes the combined features of the RGB and Depth branches. In some cases, RGB features are better; in some other cases the depth features are better and vice-versa. Therefore, we introduce additional components to the total loss to specifically penalize RGB and depth features. The new total loss now becomes<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"292\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-1024x292.png\" alt=\"\" class=\"wp-image-156\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-1024x292.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-300x86.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-768x219.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-920x263.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-230x66.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-350x100.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03-480x137.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-19-03.png 1061w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This results in removing the unwanted background activations, as shown in the same images below<\/p>\n\n\n\n<figure class=\"wp-block-gallery has-nested-images columns-default is-cropped wp-block-gallery-2 is-layout-flex wp-block-gallery-is-layout-flex\">\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"635\" height=\"607\" data-id=\"157\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-20-52.png\" alt=\"\" class=\"wp-image-157\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-20-52.png 635w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-20-52-300x287.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-20-52-230x220.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-20-52-350x335.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-20-52-480x459.png 480w\" sizes=\"auto, (max-width: 635px) 100vw, 635px\" \/><figcaption>Original UCN<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"635\" height=\"607\" data-id=\"159\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-21-04.png\" alt=\"\" class=\"wp-image-159\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-21-04.png 635w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-21-04-300x287.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-21-04-230x220.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-21-04-350x335.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-21-04-480x459.png 480w\" sizes=\"auto, (max-width: 635px) 100vw, 635px\" \/><figcaption>UCN with extra loss<\/figcaption><\/figure>\n<\/figure>\n\n\n\n<p>We also analyzed weighted RGB and depth loss, but it doesn&#8217;t provide better results. To quantitatively compare these model variants, we compute and analyze these models&#8217; precision, recall, and F1 scores. The below table summarizes the results of these experiments.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"487\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-1024x487.png\" alt=\"\" class=\"wp-image-160\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-1024x487.png 1024w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-300x143.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-768x365.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-920x437.png 920w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-230x109.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-350x166.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34-480x228.png 480w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-28-34.png 1229w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>Performance analysis of various UCN models on the generated synthetic data<\/figcaption><\/figure><\/div>\n\n\n\n<p>Therefore, we use the UCN model with extra loss as our final model for integrating with Mujin robots.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Integrating with Mujin Robots<\/h2>\n\n\n\n<p>The final goal of these tasks is to integrate with the Mujin robots to perform automatic piece-picking of unseen objects in warehouse scenarios. Although we have not focused on stage-4 and stage-5 of the pipeline, we use the existing control &amp; planning infrastructure of Mujin to integrate our UCN model. The existing Mujin pipeline expects a cost volume\/affordance map to perform picking. Therefore, we convert the instance segmentation masks provided by UCN into cost maps and input them into the picking pipeline for the robots to pick. An example input and output of UCN in the pipeline is as shown below.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"809\" height=\"640\" src=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30.png\" alt=\"\" class=\"wp-image-164\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30.png 809w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30-300x237.png 300w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30-768x608.png 768w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30-230x182.png 230w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30-350x277.png 350w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/Screenshot-from-2022-12-20-04-47-30-480x380.png 480w\" sizes=\"auto, (max-width: 809px) 100vw, 809px\" \/><figcaption>Cost map generated from instance segmentation masks in Mujin pipeline.<\/figcaption><\/figure>\n\n\n\n<p><a>&nbsp;<\/a>The below visualizations showcases the full pipeline in action, where the robot is able to pick up an unseen object from the container using the detections provided by our fine-tuned UCN model.<\/p>\n\n\n\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=nDMNgQCBxmk\">https:\/\/www.youtube.com\/watch?v=nDMNgQCBxmk<\/a><\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/www.youtube.com\/watch?v=nDMNgQCBxmk\"><img loading=\"lazy\" decoding=\"async\" width=\"268\" height=\"477\" src=\"http:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image.jpg\" alt=\"\" class=\"wp-image-163\" srcset=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image.jpg 268w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-169x300.jpg 169w, https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/12\/image-230x409.jpg 230w\" sizes=\"auto, (max-width: 268px) 100vw, 268px\" \/><\/a><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Abstract Object pose estimation is a fundamental requirement for robotic manipulation tasks. Most methods that have high accuracy to be deployed in [&hellip;]<\/p>\n","protected":false},"author":134,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-48","page","type-page","status-publish","hentry"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v27.4 - https:\/\/yoast.com\/product\/yoast-seo-wordpress\/ -->\n<title>Technical Report - Automate Pose Estimation for Robotics Manipulation<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Technical Report - Automate Pose Estimation for Robotics Manipulation\" \/>\n<meta property=\"og:description\" content=\"Abstract Object pose estimation is a fundamental requirement for robotic manipulation tasks. Most methods that have high accuracy to be deployed in [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/\" \/>\n<meta property=\"og:site_name\" content=\"Automate Pose Estimation for Robotics Manipulation\" \/>\n<meta property=\"article:modified_time\" content=\"2022-12-20T10:10:58+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-1024x511.png\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data1\" content=\"25 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/\",\"url\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/\",\"name\":\"Technical Report - Automate Pose Estimation for Robotics Manipulation\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/wp-content\\\/uploads\\\/sites\\\/68\\\/2022\\\/04\\\/image-1-1024x511.png\",\"datePublished\":\"2022-04-30T22:06:28+00:00\",\"dateModified\":\"2022-12-20T10:10:58+00:00\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/#primaryimage\",\"url\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/wp-content\\\/uploads\\\/sites\\\/68\\\/2022\\\/04\\\/image-1.png\",\"contentUrl\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/wp-content\\\/uploads\\\/sites\\\/68\\\/2022\\\/04\\\/image-1.png\",\"width\":1237,\"height\":617},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/technical-report\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Technical Report\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/#website\",\"url\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/\",\"name\":\"Automate Pose Estimation for Robotics Manipulation\",\"description\":\"CMU MSCV 2022 Capstone Project - Group 13\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/mscvprojects.ri.cmu.edu\\\/2022team13\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Technical Report - Automate Pose Estimation for Robotics Manipulation","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/","og_locale":"en_US","og_type":"article","og_title":"Technical Report - Automate Pose Estimation for Robotics Manipulation","og_description":"Abstract Object pose estimation is a fundamental requirement for robotic manipulation tasks. Most methods that have high accuracy to be deployed in [&hellip;]","og_url":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/","og_site_name":"Automate Pose Estimation for Robotics Manipulation","article_modified_time":"2022-12-20T10:10:58+00:00","og_image":[{"url":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-1024x511.png","type":"","width":"","height":""}],"twitter_card":"summary_large_image","twitter_misc":{"Est. reading time":"25 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/","url":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/","name":"Technical Report - Automate Pose Estimation for Robotics Manipulation","isPartOf":{"@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/#website"},"primaryImageOfPage":{"@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/#primaryimage"},"image":{"@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/#primaryimage"},"thumbnailUrl":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1-1024x511.png","datePublished":"2022-04-30T22:06:28+00:00","dateModified":"2022-12-20T10:10:58+00:00","breadcrumb":{"@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/#primaryimage","url":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1.png","contentUrl":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-content\/uploads\/sites\/68\/2022\/04\/image-1.png","width":1237,"height":617},{"@type":"BreadcrumbList","@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/technical-report\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/"},{"@type":"ListItem","position":2,"name":"Technical Report"}]},{"@type":"WebSite","@id":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/#website","url":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/","name":"Automate Pose Estimation for Robotics Manipulation","description":"CMU MSCV 2022 Capstone Project - Group 13","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"}]}},"_links":{"self":[{"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/pages\/48","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/users\/134"}],"replies":[{"embeddable":true,"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/comments?post=48"}],"version-history":[{"count":18,"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/pages\/48\/revisions"}],"predecessor-version":[{"id":177,"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/pages\/48\/revisions\/177"}],"wp:attachment":[{"href":"https:\/\/mscvprojects.ri.cmu.edu\/2022team13\/wp-json\/wp\/v2\/media?parent=48"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}