Teaser image.

Marvel-40M+ is a large-scale 3D captioning dataset with over 40M+ captions for ~9M 3D models, featuring high-quality, domain-specific, and multi-level text descriptions of 3D assets.

🌟 Contributions

  • MARVEL-40M+: We introduce MARVEL-40M+, the largest 3D captioning dataset containing 40M+ annotations for 8.9M 3D assets from 7 datasets (Objaverse, Shapenet, Pix3D, ABO, OmniObject3D, Toys4K, GSO). Marvel annotations are generated using an automatic pipeline using open-source VLM and LLMs.
  • Multi-Level Annotations: We propose a multi-level annotation structure that spans from detailed descriptions for fine-grained 3D reconstruction to concise tags for quick modeling.
  • Domain-Specific Annotations: We incorporate human metadata from source datasets into our pipeline to inject domain-specific information in the text descriptions and reduce VLM hallucinations.
  • Marvel-FX3D: We introduce MARVEL-FX3D, a two-stage framework for high-fidelity TT3D generation.

🛠️ Marvel Annotation Pipeline

We introduce a multi-stage pipeline for automatic 3D captioning. Our pipeline starts with human metadata and rendered multi-view images to create detailed visual descriptions using InternVL2. These contain object names, shapes, textures, colors, and contextual environments. Qwen2 then processes these descriptions into five hierarchical levels, progressively compressing different aspects of the 3D assets.

Marvel Annotation Pipeline

🧩 Multi Level Annotations

Our multi-level annotation framework introduces a hierarchical visual elaboration strategy, enabling flexible descriptions of 3D reconstructions at varying levels of detail. Instead of relying on rigid, direct prompts—which often limit expressive capability—we adopt a progressive prompting approach. This method allows models to generate context-aware outputs tailored for diverse use cases, from richly detailed geometric descriptions to concise tags optimized for rapid prototyping. Each level incrementally distills complexity, supporting adaptive content generation for tasks that require anything from comprehensive modeling insights to minimal semantic cues.

Multi Level Annotation

🧬 Domain Specific Annotations

High-quality 3D annotations require more than visual accuracy—they need domain-aware context. We leverage user-generated metadata from source datasets to enrich descriptions with relevant semantics. Since this metadata often includes noise or sensitive content, we apply a filtering stage to retain only clean, domain-specific terms like technical labels or character names. While optional, this enhances annotation quality without being essential to the pipeline.

Anime Characters

The 3D model features Roronoa Zoro from One Piece, holding three swords, with a blue dragon coiling around him. A water splash surrounds them, adding intensity to the dynamic scene.

Game Characters

The 3D model shows Ghost Jawbone from Call of Duty Mobile. He wears a black helmet, a brownish-khaki cape, a dark green shirt with a tattoo, an olive green vest with a red pouch, olive green cargo pants, and dark combat boots. The character stands upright with a flowing cape.

Automotive

The Harley Davidson XR1200x is a black motorcycle with a sturdy frame, V-twin engine, and dual exhaust pipes. It has a black leather seat, raised handlebars, and a round headlight. The bike is streamlined and aggressive-looking, with disc brakes and visible suspension.

CAD Model

The Keystone 6" Knife Gate Valve is a cylindrical industrial valve with a rectangular gearbox and a handwheel for manual operation. It has symmetrical bolt holes and is made of smooth, gray metal. Used for controlling fluid flow in harsh conditions.

Flower

A 3D model of a Marsh Marigold with yellow flowers and green, heart-shaped leaves. The flowers have five petals and a center with stamens. Slender stems support the leaves and flowers. The model is bright and simple.

Atomic Structure

Vitamin B2 (Riboflavin) is a complex molecule with a central ring system and extending side chains. It has an elongated, irregular shape and uses color coding to differentiate atoms: carbon (grey/black), oxygen (red), nitrogen (blue), and hydrogen (white/grey).

Food

Three pieces of salmon nigiri sushi on a wooden board, with a small bowl of soy sauce. The sushi has oval rice balls topped with pink salmon, green and red garnishes. The wooden board has rounded edges, and the soy sauce bowl is circular and matte. A small green spot is on the board.

Daily Object

The Siemens Coffee Machine is a sleek, rectangular device with a bean hopper on top, a control panel below, a transparent water tank on the left, and a metallic drip tray at the bottom. It is mostly black with a transparent tank and a silver tray.

Mythological Character

The 3D model shows Perseus, a human figure in ancient Greek armor, riding a white winged horse named Pegasus. They appear to be flying, with Perseus holding a sword and shield. The horse has smooth white fur and feathered wings.

Monument

The Monument of Dante Alighieri shows a standing figure of Dante in flowing robes, holding a book. It stands on a decorated stone plinth and is made of smooth, polished marble. Located outdoors in Padua, Italy.

Scene

A colorful flamingo stands in a tranquil pond, surrounded by green lily pads with white flowers and tall grasses. The pond water is glossy, and the flamingo has a rainbow gradient on its wings. The base is flat and brownish, creating a peaceful, natural scene.

Digital Surface

The 3D model shows the Glaciar Perito Moreno in Santa Cruz, Argentina. It features jagged mountains, a large turquoise lake, and ice formations extending into the water. The surrounding area has green grass and brown rocks. Rivers flow into the lake from the mountains. The scene is natural and isolated, highlighting the glacier's grandeur.

🎨 MARVEL-FX3D

Marvel Annotation Pipeline

MARVEL-FX3D is our Text-to-3D pipeline that finetunes Stable Diffusion 3.5 on MARVEL captions of Objaverse dataset and uses pretrained Stable Fast 3D to generate a textured mesh in 15s.

🧊 Text-to-3D Generation

GLB ✓

A traditional spear with a decorative spearhead, a smooth metallic shaft, and a teardrop-shaped gem. The spearhead has three elongated blades and a spherical section with small protrusions. It is primarily metallic blue with a cyan or turquoise gem.

GLB ✓

A rocky landscape with a large, dark blue, jagged rock and smaller, smooth gray boulders. A small, bright cyan dot is in the upper part.

GLB ✓

The Fusion Epilog Laser Cutter is a rectangular machine with a robust design. It has a transparent blue cover over the cutting area, a hinged lid for access, and a control panel with buttons and a screen. The base plate supports the material being cut, and side panels provide access for loading and ventilation.

GLB ✓

The 3D model is a segment of an old stone wall with irregularly shaped stones. It has a rough, uneven texture with moss and plants growing on top. The colors are mostly gray and beige with green moss patches. The wall looks weathered and aged, suggesting a historic structure.

GLB ✓

The 3D model is of a pair of traditional Japanese throwing knives called kunai. Each has a long, pointed blade and a cylindrical handle with a circular ring and a small loop. The surface is smooth and matte, and the model is uniformly white.

GLB ✓

A tall palm tree with a slender trunk and a wide base, topped with large, feather-like fronds. Small rocks and plants surround the base, set on sandy ground. The trunk is brown, and the fronds are green.

GLB ✓

A wooden crate with a flat lid, four sides, a bottom, and metal reinforcement bars. It has visible wood grain and dark, aged metal. One side features green graffiti text.

GLB ✓

A destroyed 2S3 Akatsiya cannon, elongated and cylindrical, with a curved barrel and angular turret. Heavily damaged, rusted, and charred, set in a desolate, debris-strewn battlefield.

BibTeX