Advancements in NVIDIA AI Research: From Neural 3D Reconstruction to Virtual Behavior Generation

Ahmed
13 min readJan 15, 2024

--

Table of Contents:

1- Introduction

2- Topic 1: Neuralangelo

3- Topic 2: Magic3D

4- Topic 3: ADMM

5- Topic 4: EUREKA

6- Topic 5: Align your Latents

7- Topic 6: eDiff-I

8- Topic 7: Conditional Adversarial Latent Models

9- Topic 8: Learning Physically Simulated Tennis Skills

10- Topic 9: FlexiCubes

11- Conclusion

1- Introduction

In this compilation of cutting-edge research topics, NVIDIA in 2023 explored advancements in diverse fields, from computer graphics and 3D reconstruction to generative modeling and physics-driven simulations. These topics showcase innovative solutions to complex challenges, using techniques such as neural rendering, diffusion models, and large language models.

Each research endeavor contributes to the growing landscape of artificial intelligence, offering practical applications in areas like content creation, robotics, and virtual environments.

1.1. Neuralangelo. It introduces a revolutionary system for detailed 3D surface reconstruction from RGB images using neural volume rendering.

1.2. Magic3D. It tackles limitations in text-to-3D synthesis, presenting a two-stage optimization framework to refine and generate high-resolution 3D content based on textual prompts.

1.3. ADMM. Addressing the intricacies of simulating Discrete Elastic Rods (DER) with Coulomb friction, this research introduces an optimized solver using GPU parallel computing capabilities.

1.4. EUREKA. It introduces a novel reward design algorithm empowered by Large Language Models (LLMs) for complex low-level manipulation tasks.

1.5. Align your Latents. This research introduces Latent Diffusion Models (LDMs) applied to high-resolution video generation, extending the paradigm of efficient image synthesis.

1.6. eDiff-I. It explores large-scale diffusion-based generative models for text-conditioned high-resolution image synthesis.

1.7. Conditional Adversarial Latent Models (CALM). It presents an approach for generating diverse and directable behaviors for user-controlled virtual characters.

1.8. Learning Physically Simulated Tennis Skills. This research focuses on generating realistic and diverse virtual tennis player movements by combining video annotation, low-level imitation policy, motion embedding, and high-level motion planning policy.

1.9. FlexiCubes. It introduces a novel approach to gradient-based mesh optimization, presenting a specialized isosurface representation for refining 3D surface meshes.

These research topics collectively display the forefront of AI innovation, pushing boundaries in computer graphics, generative modeling, physics simulations, and interactive virtual environments.

2- Topic 1: Neuralangelo

A research paper focused on neural surface reconstruction, a technique for recovering detailed 3D surfaces from multi-view images using image-based neural rendering. The proposed method, named Neuralangelo, combines multi-resolution 3D hash grids with neural surface rendering to improve the recovery of detailed structures in real-world scenes.

Figure 1: Neuralangelo is a system designed for achieving detailed 3D surface reconstruction from RGB images through neural volume rendering, and notably, it accomplishes this without the need for additional data like segmentation or depth information. The illustration depicts a 3D mesh of a courthouse that has been extracted using Neuralangelo.

Two key components of Neuralangelo are highlighted: the use of numerical gradients for computing higher-order derivatives and a coarse-to-fine optimization approach on the hash grids for controlling different levels of details. The method proves effective in recovering dense 3D surface structures from multi-view images without auxiliary inputs like depth, surpassing previous methods in fidelity.

The applications of this approach include large-scale scene reconstruction from RGB video captures, with potential uses in augmented reality, virtual reality, mixed reality, and robotics. The paper compares Neuralangelo with classical multi-view stereo algorithms and demonstrates its advantages in handling ambiguous observations and achieving high-quality reconstructions. The proposed framework is built on the Instant NGP (Neural Graphics Primitives) representation, which introduces a hybrid 3D grid structure with multi-resolution hash encoding and a lightweight MLP, enhancing the representation power of neural fields.

The paper concludes with empirical demonstrations of the effectiveness of Neuralangelo on various datasets, highlighting improvements in both reconstruction accuracy and view synthesis quality over previous image-based neural surface reconstruction methods.

Paper: https://arxiv.org/pdf/2306.03092.pdf

Source Code: https://colab.research.google.com/drive/13u8DX9BNzQwiyPPCB7_4DbSxiQ5-_nGF

3- Topic 2: Magic3D

This research paper introduces a method called Magic3D, addressing limitations in the DreamFusion text-to-image diffusion model for optimizing Neural Radiance Fields (NeRF) in text-to-3D synthesis. DreamFusion suffers from slow NeRF optimization and low-resolution image space supervision, resulting in time-consuming and low-quality 3D models. Magic3D overcomes these challenges with a two-stage optimization framework.

Figure 2: Magic3D operates by producing high-resolution 3D content based on a given text prompt through a gradual refinement process. In the initial phase, a low-resolution diffusion prior is employed to optimize neural field representations encompassing color, density, and normal fields, resulting in the creation of a coarse model. Subsequently, a textured 3D mesh is differentiably extracted from the density and color fields of this coarse model. The next step involves fine-tuning using a high-resolution latent diffusion model. Following this optimization process, our model excels in generating top-notch 3D meshes endowed with intricate textures.

Initially, it obtains a coarse model using a low-resolution diffusion prior and a sparse 3D hash grid structure. This coarse representation serves as an initialization for further optimization, leading to a textured 3D mesh model using an efficient differentiable renderer and a high-resolution latent diffusion model. Magic3D achieves faster (2× compared to DreamFusion) and higher-resolution 3D mesh model generation, demonstrated through user studies where 61.7% of raters prefer Magic3D. The method empowers users with enhanced control over 3D synthesis, presenting new possibilities for creative applications.

The research emphasizes the growing demand for 3D digital content across various fields, and it highlights the challenges associated with 3D content creation, suggesting that integrating natural language processing could democratize and expedite the process. The paper draws parallels between the progress in image content creation from text prompts and the slower development in 3D content generation due to limited datasets. DreamFusion’s approach is discussed, emphasizing its limitations in handling high-frequency geometric and texture details and its inefficient optimization process.

In contrast, Magic3D proposes a coarse-to-fine optimization strategy using diffusion priors at different resolutions, enabling the synthesis of detailed 3D models within a reduced computation time. The method is explained in detail, including the use of a hash grid for the coarse representation and a switch to optimizing mesh representations in the second stage, using an efficient differentiable rasterizer for real-time rendering of high-resolution images.

Paper: https://arxiv.org/pdf/2211.10440.pdf

4- Topic 3: ADMM

The research revolves around the development of a specialized solver for simulating Discrete Elastic Rods (DER) with Coulomb friction, optimized to fully harness the parallel computing capabilities of modern GPUs. The focus is on validating the simulator’s accuracy by reproducing analytical results from various experiments, including cantilever, bend–twist, and stick–slip scenarios. The enhanced solver significantly reduces iteration times, enabling real-time simulations of high-resolution hair with complex frictional interactions. The achievement of handling assemblies of several thousand elastic rods in real-time opens up possibilities for new workflows, particularly in interactive physics-based editing of digital grooms.

Figure 3: Instantaneous, physics-driven editing session involving 86,000 curves in a groom with 10% simulated guides. The process utilizes the Discrete Elastic Rods model with Coulomb friction. The sequence from left to right includes the initial groom, a pose exhibiting sagging, selection and manipulation of hair strands, trimming the selected strands, performing the same actions on the screen-left side, and finally, the resulting edited groom.

The paper discusses the challenges posed by the complex nature of hair dynamics and the need for incorporating physical information into digital grooming to ensure predictable dynamic behavior. Unlike traditional grooming methods based solely on geometric manipulation, the proposed approach involves manipulating rods within a simulation to embed both geometric and rest pose information, facilitating more realistic dynamic outcomes.

The key innovation lies in overcoming computational limitations by using modern GPU capabilities, allowing for a leap in performance compared to CPU-based solvers and paving the way for interactive physics-based hair simulation and grooming at high strand counts. The presented solver employs a carefully designed timestep incremental problem objective, mapping efficiently to parallel GPU architectures and offering a substantial reduction in computation times from multiple days to a few hours.

Paper: https://research.nvidia.com/publication/2023-08_interactive-hair-simulation-gpu-using-admm

5- Topic 4: EUREKA

The research paper introduces EUREKA, a novel reward design algorithm empowered by Large Language Models (LLMs), such as GPT-4, to address the challenge of applying LLMs to learn complex low-level manipulation tasks, specifically dexterous pen spinning.

Figure 4: EUREKA utilizes the unaltered source code of the environment and a language-based task description as context to generate executable reward functions seamlessly from a coding LLM. The process involves a continuous loop of reward sampling, GPU-accelerated evaluation, and reflection to iteratively enhance the quality of its reward outputs.

Unlike existing approaches that require substantial domain expertise or only handle simple skills, EUREKA uses the remarkable capabilities of LLMs for zero-shot generation, code-writing, and in-context improvement to perform evolutionary optimization over reward code. This allows EUREKA to autonomously generate reward functions without task-specific prompting or pre-defined templates, outperforming expert human-engineered rewards on 83% of tasks in a suite of 29 open-source Reinforcement Learning (RL) environments.

The contributions of EUREKA include achieving human-level performance on reward design across various robot morphologies, solving dexterous manipulation tasks like pen spinning, and introducing a gradient-free in-context learning approach to reinforcement learning from human feedback (RLHF). EUREKA can incorporate various forms of human inputs to improve the quality and safety of generated rewards without model updating. The algorithm’s generality is highlighted by its ability to operate without task-specific prompts or templates, and it significantly surpasses existing LLM-based approaches due to its capacity to generate and refine free-form, expressive reward programs.

EUREKA’s algorithmic design choices include considering the environment’s source code as context, employing evolutionary search for reward candidates, and utilizing reward reflection for in-context improvement. The paper outlines the methodology, demonstrating EUREKA’s effectiveness through experiments, showcasing its zero-shot reward generation and in-context improvement. The algorithm’s scalability is emphasized, utilizing GPU-accelerated distributed reinforcement learning for efficient evaluation of intermediate rewards.

Paper: https://arxiv.org/pdf/2310.12931.pdf

Source Code: https://github.com/eureka-research/Eureka

6- Topic 5: Align your Latents

This research paper introduces Latent Diffusion Models (LDMs) and applies them to the challenging task of high-resolution video generation. LDMs are known for enabling high-quality image synthesis with reduced computational demands by training a diffusion model in a compressed latent space. The authors propose Video LDMs, extending the paradigm to handle computationally intensive video generation tasks.

Figure 5: Top: In the refinement of the temporal decoder, a fixed encoder was utilized to independently process video frames, ensuring consistent reconstructions across frames. A discriminator designed for video comprehension is also employed. Bottom: In Latent Diffusion Models (LDMs), a diffusion model is trained within a latent space. It generates latent features, which are subsequently converted into images through the decoder. It is important to note that the visualization at the bottom pertains to individual frames.

The approach involves pre-training LDMs on images and subsequently transforming the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model. This temporal dimension is trained on encoded image sequences (videos), while the pre-trained spatial layers remain fixed. The method also fine-tunes the LDM’s decoder to achieve temporal consistency in pixel space and temporally aligns pixel-space and latent DM upsamplers for enhanced spatial resolution.

The paper focuses on two real-world applications: simulating in-the-wild driving data for autonomous driving and creative content creation with text-to-video modeling. The proposed Video LDMs achieve state-of-the-art performance, generating high-resolution, long and coherent videos efficiently. The research also demonstrates the application of Video LDMs to text-to-video generation, showcasing personalized content creation possibilities.

Key contributions of the paper include:

‣ Efficient training of high-resolution, long-term consistent video generation models based on LDMs, using pre-trained image DMs and introducing temporal layers for alignment.

‣ State-of-the-art high-resolution video synthesis on real driving scene videos, producing multiple-minute-long videos.

‣ Transformation of the publicly available Stable Diffusion text-to-image LDM into a powerful text-to-video LDM.

‣ Demonstration of the compatibility of learned temporal layers with different image model checkpoints.

The paper explores new avenues for efficient digital content creation and autonomous driving simulation through the application of LDMs to high-resolution video generation.

Paper: https://arxiv.org/pdf/2304.08818.pdf

7- Topic 6: eDiff-I

This research paper discuss the large-scale diffusion-based generative models applied to text-conditioned high-resolution image synthesis. These models, which operate by gradually denoising random noise while conditioning on textual prompts, have demonstrated remarkable proficiency in comprehending complex text and exhibiting outstanding zero-shot generalization. The key innovation of such text-to-image diffusion models lies in their ability to iteratively synthesize images, starting from random noise and relying on textual cues.

Figure 6: Prompt switching during iterative denoising impacts image generation. By changing the input text from Prompt #1 to Prompt #2 at specific denoising steps, varying effects were observed on the output images. Transition percentages of 0%, 7%, 30%, 60%, and 100% are illustrated from left to right. In the last 7% of denoising, the text input seems to have no visible impact, indicating reduced utilization of the text prompt. The third output reflects influences from both prompts, while the fourth output suggests the text input in the initial 40% is overridden by the latter 60%. These results highlight the denoiser’s differential use of text input at different noise levels.

A notable finding highlighted in the paper is the qualitative evolution of synthesis behavior throughout the iterative generation process. Early in the sampling process, the model heavily depends on the provided text prompt to generate content aligned with the textual description. However, as the generation progresses, there is a discernible shift where the model gradually disregards the input text conditioning, focusing more on producing outputs with high visual fidelity. This observation challenges the conventional approach of sharing model parameters uniformly across the entire generation process, prompting the proposal of a novel methodology.

In contrast to existing practices, the paper introduces the concept of an ensemble of text-to-image diffusion models, each specialized for distinct synthesis stages. The ensemble, referred to as eDiff-I, aims to enhance text alignment, maintain inference efficiency, and preserve high visual quality. This approach involves initially training a single model, subsequently splitting it into specialized models tailored for specific stages, and progressively finetuning them. The proposed ensemble demonstrates superior performance compared to previous large-scale text-to-image diffusion models on standard benchmarks.

Paper: https://arxiv.org/pdf/2211.01324.pdf

Video: https://www.youtube.com/watch?v=WbaVvlgxbl4

8- Topic 7: Conditional Adversarial Latent Models

The research paper introduces Conditional Adversarial Latent Models (CALM) as an approach for generating diverse and directable behaviors for user-controlled interactive virtual characters.

Figure 7: The system allows users to guide the actions of a virtually simulated character by utilizing demonstrations represented through low-dimensional latent embeddings derived from motion capture data. In this illustration, the character receives instructions to crouch-walk toward a specified target, execute a kick when in proximity, and ultimately lift its arms in a celebratory gesture.

CALM employs imitation learning to capture the complexity and diversity of human motion, enabling direct control over character movements. The method learns a control policy and a motion encoder simultaneously, avoiding mere replication of given motions. CALM’s results demonstrate a semantic motion representation, allowing control over generated motions and style-conditioning for higher-level task training. The character, once trained, can be controlled through intuitive interfaces, similar to those found in video games.

The paper addresses challenges in creating realistic and diverse behaviors for virtual characters, emphasizing the need for control models that can generate complex and realistic behaviors, adapting to different environments and user inputs. Previous work focused on imitating single motion clips or increasing the diversity of generated motion, but often resulted in a loss of control over the generated motion.

CALM distinguishes itself by focusing on unsupervised techniques, using unlabeled data without assuming prior knowledge of semantic connections between motions. It jointly learns a meaningful semantic representation of skills and a control policy capable of producing selected skills. The method allows the generation of diverse movements while resembling the distributional characteristics of the motion data.

The contributions of CALM include:

‣ Jointly training a generative motion controller and a motion encoder from unlabeled motion capture data, allowing directed motion generation.

‣ Introducing precision training, a method to reuse the pretrained policy and use similarity within the learned latent space for control over produced motion when solving high-level tasks.

‣ Combining the above steps to design simple Finite State Machines (FSMs) for solving tasks without further training or meticulous design of reward functions or termination conditions.

The related work section discusses physics-constrained motion generation, differentiating between direct prediction and latent-based control. CALM builds upon latent-based control methods, focusing on learning a dense representation of human motion jointly with a directable latent-conditioned policy.

Paper: https://research.nvidia.com/labs/par/calm/assets/SIGGRAPH2023_CALM.pdf

Source Code: https://github.com/NVlabs/CALM

9- Topic 8: Learning Physically Simulated Tennis Skills

This research paper focuses on developing a system for generating realistic and diverse virtual tennis player movements. Here are the key components and processes outlined in the paper:

a. Video Annotation. The researchers utilized specific tennis matches from the US Open to extract a dataset of tennis motions. Thirteen matches from various years were selected for this purpose.

b. Low-Level Imitation Policy

  • Network Structure: The low-level policy is modeled by a neural network with specific architecture, mapping state ss to a Gaussian distribution over actions π(a∣s)π(a∣s).
  • Rewards: Manual specification of weights and scales for different factors influencing the rewards during training.
  • Training: The low-level policy is trained with a considerable number of environments using datasets like AMASS and a kinematic motion dataset extracted from tennis videos.

c. Motion Embedding

  • Network Structure: The encoder is a three-layer feed-forward neural network, and the decoder uses a mixture-of-expert (MoE) architecture.
  • Training: The model is trained with scheduled sampling, a technique where the predicted pose is sometimes used as input for the next time step, and other times, the ground-truth pose is used.

d. High-Level Motion Planning Policy

  • Network Structure: Similar architecture to the low-level policy.
  • Ball Trajectory Prediction Model: A model is used to estimate future ball trajectories, providing observations for the high-level policy.
  • Rewards: Manual specification of scales for different factors influencing rewards during training.
  • Training: The high-level policy is trained with a substantial number of environments using a curriculum-based approach with three stages.

Paper: https://research.nvidia.com/labs/toronto-ai/vid2player3d/data/tennis_skills_supp.pdf

Source Code: https://github.com/nv-tlabs/vid2player3d

10- Topic 9: FlexiCubes

This research addresses gradient-based mesh optimization, focusing on iteratively refining 3D surface meshes represented as isosurfaces of scalar fields.

Figure 9: FlexiCubes is an isosurface representation of high quality, tailor-made for optimizing meshes based on gradients concerning geometric, visual, or even physical objectives. An extensive quality assessment and showcase that FlexiCubes enhances outcomes across various applications.

Common in applications like photogrammetry and generative modeling, existing methods modify traditional isosurface extraction algorithms but face challenges in handling unknown meshes effectively. The proposed approach, FlexiCubes, introduces a specialized isosurface representation designed for optimizing arbitrary meshes concerning geometric, visual, or physical goals. By incorporating carefully chosen parameters, FlexiCubes allows flexible local adjustments to the mesh geometry and connectivity during optimization.

The method utilizes Dual Marching Cubes for improved topological properties and extends to generate tetrahedral and hierarchically-adaptive meshes. Extensive experiments demonstrate the effectiveness of FlexiCubes in improving mesh quality and geometric fidelity for various applications. The research contributes to the field of differentiable mesh generation and presents a valuable tool for automatic high-quality mesh generation in diverse contexts.

Paper: https://arxiv.org/pdf/2308.05371.pdf

Source Code: https://github.com/nv-tlabs/FlexiCubes

11- Conclusion

The diverse research topics discussed signify ongoing efforts in artificial intelligence and computer science. These explorations introduce new methodologies and systems across computer graphics, generative modeling, physics simulation, and interactive virtual environments.

Neuralangelo’s system for 3D surface reconstruction without extra data and Magic3D’s text-to-3D synthesis offer promising impacts on photogrammetry and content creation. Large language models, as seen in EUREKA, mark a shift in autonomous reward design, surpassing human-engineered rewards in reinforcement learning tasks. In physics simulations, ADMM enables real-time editing of digital grooms, while FlexiCubes tackles gradient-based mesh optimization. These advancements promise flexible solutions applicable across various domains.

Latent Diffusion Models in high-resolution video generation and eDiff-I in text-conditioned image synthesis emphasize iterative refinement and specialized ensembles. Conditional Adversarial Latent Models (CALM) offer a novel perspective on generating diverse behaviors for virtual characters through unsupervised techniques.

In summary, these research endeavors collectively underscore the relentless pursuit of excellence and exploration within artificial intelligence. The synergy of diverse methodologies, creative thinking, and interdisciplinary collaboration propels us toward an era where AI not only replicates but also innovates, pushing the boundaries of what is conceivable. These pioneering works contribute not only to academia but also set the stage for transformative applications that will shape the future of technology and human-machine interactions.

--

--