CUBE-LLM
This recent paper from NVidia and UT Austin(https://arxiv.org/abs/2405.03685) caught my eye. If you are a researcher or practitioner in image understanding, augmented reality or robotics then you should take a look.
The model proposes a multimodal LLM, called Cube-LLM, for 3D object detection in 2D images, without any significant architecture changes and by training the model on unifying + augmenting 2D, 3D vision-language datasets. Traditional object detection methods either output 2D bounding boxes or pixel masks. While this works for most applications, the output is devoid of critical 3D information encoded in the image. Cube-LLM can output a 3D bounding box, which promises a higher quality/nuanced image understanding. The LLM gives competitive results at several Visual Question-Answer datasets.
The model can be used for robotic automation including Autonomous Vehicles(estimating proximity to different objects), AR rendering in cluttered spaces (depth estimation, occlusion) for games & AR in e-commerce. The model is trained with image sizes of up to 672 × 672. This is restrictive but it can be extended to larger image sizes. The model also benefits from CoT reasoning i.e. when prompted to generate 2D outputs before 3D. This can come in handy for 3D applications like depth estimation from image. Lastly, this model can be extended to other vision-language tasks (e.g., video understanding) and for applications like virtual product / ad placement in videos.
Here are some technical details -
Uses LLava 1.5 with CLIP vision encoder replaced by DINOV2 vision encoder. The vision encoder is not aligned with text but there is minimal performance degradation.
The model is trained with Q&A dataset, including text -> 2D box/center, 2D box/center -> text, text -> 3D box, 2D center -> depth/3D box etc. The output is text tokens generated in an auto-regressive way, including values for 2D and 3D boxes, centers.
To further research in this domain, the paper also proposes a new dataset, LV3D, which is formulated for multi-turn question answering, where the model receives an image and a series of question/answers about the 3D scene depicted in the image.