We introduce a multi-stage pipeline for automatic 3D captioning. Starting with human metadata and rendered multi-view images, we create detailed visual descriptions using InternVL2 โ capturing object names, shapes, textures, colors, and contextual environments. Qwen2 then processes these into five hierarchical levels, progressively compressing different aspects of the 3D assets.
Our multi-level annotation framework introduces a hierarchical visual elaboration strategy, enabling flexible descriptions at varying levels of detail. Instead of rigid prompts, we adopt a progressive prompting approach โ generating context-aware outputs from richly detailed geometric descriptions to concise tags for rapid prototyping. Each level incrementally distills complexity, supporting adaptive content generation for tasks requiring anything from comprehensive modeling insights to minimal semantic cues.
High-quality 3D annotations require more than visual accuracy โ they need domain-aware context. We leverage user-generated metadata from source datasets to enrich descriptions with relevant semantics. A filtering stage retains only clean, domain-specific terms like technical labels or character names. While optional, this significantly enhances annotation quality.
MARVEL-FX3D is our Text-to-3D pipeline that fine-tunes Stable Diffusion 3.5 on MARVEL captions of Objaverse dataset and uses pretrained Stable Fast 3D to generate a textured mesh in just 15 seconds.
We evaluate annotation quality and pipeline performance across multiple metrics and comparisons with existing datasets and models. Hover over the plots for more details.
@inproceedings{sinha2025marvel,
title = {MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation},
author = {Sinha, Sankalp and Khan, Mohammad Sadil and Usama, Muhammad and Sam, Shino and Stricker, Didier and Ali, Sk Aziz and Afzal, Muhammad Zeshan},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={8105--8116},
year={2025}
}