Nemotron 3 Nano Omni allows agents to "see and hear" in real-time

Nvidia unveiled Nemotron 3 Nano Omni, an open multimodal AI model that integrates vision, audio, and language capabilities into a unified architecture.

The model aims to address inefficiencies in current enterprise AI systems, which often rely on fragmented pipelines. It processes a variety of inputs—including text, images, audio, video, documents, charts, and graphical interfaces—while generating text outputs.

Built on a 30-billion-parameter hybrid mixture-of-experts architecture, Nemotron 3 Nano Omni activates approximately 3 billion parameters per inference. Nvidia claims that it provides the knowledge capacity of larger models while significantly reducing compute costs.

Nvidia stated that the Nemotron 3 Nano Omni achieves up to 9 times higher throughput than comparable open omni models. For video reasoning tasks, it offers roughly three times higher throughput with 2.75 times lower compute requirements, backed by a 256K-token context window. The model reportedly leads six benchmarks for complex document intelligence and video and audio understanding.

Notable adopters of the model include Foxconn, Palantir, and H Company. “Utilizing the Nemotron 3 Nano Omni allows our agents to swiftly analyze full HD screen recordings, a capability that was previously unfeasible,” said Gautier Cloix, CEO of H Company.

Dell, Oracle, and Infosys are currently assessing the model for potential adoption. Nemotron 3 Nano Omni is accessible on platforms such as Hugging Face, OpenRouter, Amazon SageMaker JumpStart, Vultr, and over 25 partner platforms. It comes equipped with open weights, datasets, and training recipes for deployment across various environments.

This model is part of Nvidia’s broader Nemotron 3 family, which includes Super and Ultra models designed for more intensive reasoning tasks. The Nemotron 3 series has reached over 50 million downloads in the past year.

Featured image credit