CVPR 2025

MARVEL-40M+
Multi-Level Visual Elaboration for
High-Fidelity Text-to-3D Content Creation

40M+
Captions
~9M
3D Models
7
Datasets
5
Annotation Levels
Sankalp Sinha1*, Mohammad Sadil Khan1* , Muhammad Usama1, Shino Sam1,
Didier Stricker1, Sk Aziz Ali2, Muhammad Zeshan Afzal1
1 DFKI ยท RPTU ยท MindGarage ยท 2 BITS Pilani, Hyderabad
CVPR 2025
scroll
MARVEL-40M+ Teaser

Marvel-40M+ is a large-scale 3D captioning dataset with over 40M+ captions for ~9M 3D models, featuring high-quality, domain-specific, and multi-level text descriptions of 3D assets.

๐ŸŒŸ What We Introduce

๐Ÿ—‚๏ธ
MARVEL-40M+ Dataset
40M+ annotations for 8.9M 3D assets across 7 datasets, generated via an automatic pipeline.
๐Ÿงฉ
Multi-Level Annotations
Five hierarchical levels from fine-grained geometric detail down to concise tags.
๐Ÿงฌ
Domain-Specific Annotations
Human metadata for domain context and reduced VLM hallucinations.
๐ŸŽจ
Marvel-FX3D Pipeline
Two-stage Text-to-3D framework in 15 seconds.

๐Ÿ› ๏ธ Marvel Annotation Pipeline

We introduce a multi-stage pipeline for automatic 3D captioning. Starting with human metadata and rendered multi-view images, we create detailed visual descriptions using InternVL2 โ€” capturing object names, shapes, textures, colors, and contextual environments. Qwen2 then processes these into five hierarchical levels, progressively compressing different aspects of the 3D assets.

Marvel Annotation Pipeline

๐Ÿงฉ Multi-Level Annotations

Our multi-level annotation framework introduces a hierarchical visual elaboration strategy, enabling flexible descriptions at varying levels of detail. Instead of rigid prompts, we adopt a progressive prompting approach โ€” generating context-aware outputs from richly detailed geometric descriptions to concise tags for rapid prototyping. Each level incrementally distills complexity, supporting adaptive content generation for tasks requiring anything from comprehensive modeling insights to minimal semantic cues.

Multi Level Annotation

๐Ÿงฌ Domain-Specific Annotations

High-quality 3D annotations require more than visual accuracy โ€” they need domain-aware context. We leverage user-generated metadata from source datasets to enrich descriptions with relevant semantics. A filtering stage retains only clean, domain-specific terms like technical labels or character names. While optional, this significantly enhances annotation quality.

Anime Characters

The 3D model features Roronoa Zoro from One Piece, holding three swords, with a blue dragon coiling around him. A water splash surrounds them, adding intensity to the dynamic scene.

Game Characters

The 3D model shows Ghost Jawbone from Call of Duty Mobile. He wears a black helmet, a brownish-khaki cape, a dark green shirt with a tattoo, an olive green vest with a red pouch, olive green cargo pants, and dark combat boots. The character stands upright with a flowing cape.

Automotive

The Harley Davidson XR1200x is a black motorcycle with a sturdy frame, V-twin engine, and dual exhaust pipes. It has a black leather seat, raised handlebars, and a round headlight.

CAD Model

The Keystone 6" Knife Gate Valve is a cylindrical industrial valve with a rectangular gearbox and a handwheel for manual operation. It has symmetrical bolt holes and is made of smooth, gray metal. Used for controlling fluid flow in harsh conditions.

Animal

A 3D model of a Redcap Oranda goldfish with a rounded body, bright red cap, and delicate, feathery fins. It has detailed eyes and a flowing tail fin.

Flower

A 3D model of a Marsh Marigold with yellow flowers and green, heart-shaped leaves. The flowers have five petals and a center with stamens. Slender stems support the leaves and flowers.

Atomic Structure

Vitamin B2 (Riboflavin) is a complex molecule with a central ring system and extending side chains. It uses color coding to differentiate atoms: carbon (grey), oxygen (red), nitrogen (blue), and hydrogen (white).

Food

Three pieces of salmon nigiri sushi on a wooden board, with a small bowl of soy sauce. The sushi has oval rice balls topped with pink salmon, green and red garnishes.

Daily Object

The Siemens Coffee Machine is a sleek, rectangular device with a bean hopper on top, a control panel below, a transparent water tank on the left, and a metallic drip tray at the bottom.

Mythological Character

The 3D model shows Perseus, a human figure in ancient Greek armor, riding a white winged horse named Pegasus. They appear to be flying, with Perseus holding a sword and shield.

Monument

The Monument of Dante Alighieri shows a standing figure of Dante in flowing robes, holding a book. It stands on a decorated stone plinth and is made of smooth, polished marble. Located outdoors in Padua, Italy.

Scene

A colorful flamingo stands in a tranquil pond, surrounded by green lily pads with white flowers and tall grasses. The pond water is glossy, and the flamingo has a rainbow gradient on its wings.

Digital Surface

The 3D model shows the Glaciar Perito Moreno in Santa Cruz, Argentina. It features jagged mountains, a large turquoise lake, and ice formations extending into the water. The surrounding area has green grass and brown rocks.

๐ŸŽจ MARVEL-FX3D

MARVEL-FX3D Pipeline

MARVEL-FX3D is our Text-to-3D pipeline that fine-tunes Stable Diffusion 3.5 on MARVEL captions of Objaverse dataset and uses pretrained Stable Fast 3D to generate a textured mesh in just 15 seconds.

๐ŸงŠ Text-to-3D Generation Results

A traditional spear with a decorative spearhead, a smooth metallic shaft, and a teardrop-shaped gem. Primarily metallic blue with a cyan turquoise gem.

A rocky landscape with a large, dark blue, jagged rock and smaller, smooth gray boulders. A small, bright cyan dot is in the upper part.

The Epilog Fusion Laser Cutter โ€” a rectangular machine with a transparent blue cover, hinged lid, and control panel. The base plate supports the material being cut.

A segment of an old stone wall with irregularly shaped stones. Rough, uneven texture with moss and plants growing on top. Colors are mostly gray and beige with green patches.

A pair of traditional Japanese throwing knives called kunai. Each has a long, pointed blade and a cylindrical handle with a circular ring. Surface is smooth and matte, uniformly white.

A tall palm tree with a slender trunk and a wide base, topped with large, feather-like fronds. Small rocks and plants surround the base on sandy ground.

A wooden crate with a flat lid, four sides, a bottom, and metal reinforcement bars. Visible wood grain and dark, aged metal. One side features green graffiti text.

A destroyed 2S3 Akatsiya cannon โ€” elongated and cylindrical with curved barrel and angular turret. Heavily damaged, rusted, and charred in a desolate battlefield.

๐Ÿ“Š Quantitative Evaluation

We evaluate annotation quality and pipeline performance across multiple metrics and comparisons with existing datasets and models. Hover over the plots for more details.

BibTeX

@inproceedings{sinha2025marvel, title = {MARVEL-40M+: Multi-Level Visual Elaboration for High-Fidelity Text-to-3D Content Creation}, author = {Sinha, Sankalp and Khan, Mohammad Sadil and Usama, Muhammad and Sam, Shino and Stricker, Didier and Ali, Sk Aziz and Afzal, Muhammad Zeshan}, booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages={8105--8116}, year={2025} }
Lab Logo