LDM3D, the industry's first generative AI model with depth mapping
Potential for content creation, metaverse, and digital experience innovation
One area that has made significant progress in recent years is computer vision, and generative AI in particular. However, many of the current advanced generative AI models are limited to 2D image generation, but Intel is attracting attention by releasing a model that can generate 3D images.
Intel Labs announced today that it has unveiled a new diffusion model, the Latent Diffusion Model for 3D (LDM3D), in collaboration with Blockade Labs that uses generative artificial intelligence (AI) to create photorealistic 3D visual content.
LDM3D is the first in the industry to use a diffusion process to create depth maps, producing vivid and immersive 360-degree 3D images. LMD3D has the potential to transform a variety of industries, from content creation to metaverse applications to digital experiences, including entertainment, gaming, architecture, and design.
Unlike traditional diffusion models that typically only generate a 2D RGB image from a text prompt, LDM3D allows generating both an image and a depth map from a given text prompt. Using roughly the same number of parameters as the latent stable diffusion model, LDM3D can provide more accurate relative depth for each image pixel compared to standard post-processing methods for depth estimation.
Images and depth maps generated by LDM3D allow users to transform text descriptions into detailed 360-degree panoramas of tranquil tropical beaches, modern skyscrapers, or sci-fi worlds. Capable of capturing vast amounts of information, this ability instantly enhances overall realism and immersion, enabling innovative applications in a wide range of industries, from entertainment, gaming, interior design, real estate listings, and even virtual museums to immersive virtual reality (VR) experiences.
LDM3D also won the Best Poster Award at the 3DMV Workshop held at CVPR (Conference on Computer Vision and Pattern Recognition) on the 20th.
LDM3D was trained on a dataset consisting of a 10,000-sample subset of the LAION-400M database, which contains over 400 million images and captions. The team used the Dense Predictive Transformer (DPT) large-scale estimation model previously developed at Intel Labs to annotate the training corpus. The DPT model provides accurate relative depth for each pixel in the image. The LAION-400M dataset was built for research purposes so that researchers and the community interested in the field can test model training at scale.
The LDM3D model was trained on the Intel AI supercomputer powered by Intel Xeon processors and the Intel Habana Gaudi AI accelerator. The resulting model and pipeline combines the generated RGB image with a depth map to create an immersive 360-degree view.
To demonstrate the potential of LDM3D, Intel and Blockade researchers developed DepthFusion, which leverages standard 2D RGB photography and depth maps to create immersive, interactive 360-degree viewing experiences.
DepthFusion leverages TouchDesigner, a node-based visual programming language for real-time multimedia content, to transform text prompts into interactive, immersive digital experiences. The LDM3D model is a single model that generates both RGB images and depth maps, saving memory space and improving latency.
LDM3D and DepthFusion provide a foundation for further advancement in the fields of multifaceted generative AI and computer vision. Intel will continue to research generative AI utilization to enhance human capabilities and build a strong open source AI research and development ecosystem so that more people can utilize the technology. LMD3D is provided as open source through HuggingFace, which is part of Intel’s efforts to strongly support the open ecosystem in the AI field. AI researchers can further improve this system and tailor it to their applications.
“The goal of generative AI is to enhance human creativity while saving time,” said Vasudev Lal, scientist for artificial intelligence and machine learning at Intel Labs. “However, most generative AI models today are limited to generating 2D images, and only a few can generate 3D images from text.”
“Unlike traditional latent stable diffusion models, LDM3D allows us to generate both an image and a depth map from a given text prompt using roughly the same number of parameters,” he said. “It provides more accurate relative depth for each pixel in the image than standard post-processing methods for depth estimation, saving developers significant time.”