Nvidia Reveals ‘Swiss Army Knife’ of AI Audio Tools: Fugatto
On Monday, Nvidia unveiled a groundbreaking AI model called Fugatto, which can generate and transform various types of audio, including music, voices, and sounds, based on prompts using both text and audio files. Fugatto, short for Foundational Generative Audio Transformer Opus, is capable of creating music snippets from text descriptions, modifying existing songs by adding or removing instruments, altering the accent or emotion in voices, and even producing entirely new sounds.
According to Nvidia, Fugatto is the first foundational generative AI model to demonstrate emergent properties — advanced capabilities that arise from the interaction of its different trained functions. It is designed to handle a wide range of audio generation and transformation tasks, supporting free-form instructions.
Rafael Valle, Nvidia’s manager of applied audio research, explained that the goal was to create a model that understands and generates sound similarly to humans. Fugatto is seen as a step toward a future where unsupervised multitask learning in audio synthesis and transformation develops from both data and model scale.
Nvidia highlighted that Fugatto could handle tasks it was not specifically trained for, such as generating sounds that evolve over time, like the Doppler effect of thunder during a passing rainstorm. Unlike most models, which can only replicate the data they’ve been trained on, Fugatto can create entirely new soundscapes, such as a thunderstorm transitioning into dawn with birds singing.
Industry experts, including Kaveh Vahdat, founder of RiseOpp, praised Fugatto as a significant leap in AI-driven audio technology. Unlike models that specialize in single tasks like music composition or voice synthesis, Fugatto offers a unified framework capable of addressing a wide range of audio-related functions. This versatility makes it a comprehensive tool for audio synthesis and transformation.
Vahdat further noted that Fugatto’s ability to generate and transform audio based on both text instructions and optional audio inputs allows users to create complex, blended audio outputs. For instance, it can merge a saxophone melody with the timbre of a meowing cat. Additionally, Fugatto’s ability to interpolate between instructions provides users with nuanced control over attributes like accent and emotion in voice synthesis, offering more customization than current AI audio tools.
Benjamin Lee, a professor at the University of Pennsylvania, emphasized that using both text and audio inputs may lead to more efficient and effective models, expanding the training data and capabilities of generative AI models. Fugatto’s dual-input approach positions it as a powerful advancement in AI audio technology.
Nvidia at Its Best
Mark N. Vena, president and principal analyst at SmartTech Research in Las Vegas, described Fugatto as a prime example of Nvidia’s excellence. He highlighted that Fugatto introduces groundbreaking advancements in AI audio processing, particularly in transforming existing audio into entirely new forms. For instance, it can turn a piano melody into a human vocal line or modify the emotional tone and accent of spoken words, offering unmatched flexibility in audio manipulation.
Vena also pointed out that Fugatto stands out from other AI audio tools by generating novel sounds from text descriptions, such as making a trumpet sound like a barking dog. These capabilities open up new possibilities for creators in fields like music, film, and gaming, providing innovative tools for sound design and audio editing.
Ross Rubin, principal analyst at Reticle Research, added that Fugatto’s approach to audio is comprehensive, encompassing sound effects, music, voice, and even new, never-before-heard sounds. He contrasted it with services like Suno, which generate songs but lack the creative precision of Fugatto, such as the ability to add instruments, change moods, or switch musical keys. Fugatto’s broad understanding of audio and its flexibility surpasses other specialized tools designed for tasks like voice synthesis or song generation.
Opens Doors for Creatives
Fugatto also promises to be valuable in a range of applications across creative industries. Kaveh Vahdat noted its potential in advertising and language learning. Agencies could use Fugatto to create customized audio content tailored to specific brand identities, including voiceovers with distinct accents or emotional tones. For language learning, educational platforms could develop personalized audio materials, like dialogues in various accents or emotional contexts, to enhance language acquisition.
Vena emphasized that Fugatto could open new doors in creative fields. Filmmakers and game developers could use it to design unique soundscapes, transforming everyday sounds into imaginative or immersive effects. Additionally, it could be used in virtual reality, assistive technologies, and education to craft personalized audio experiences, adjusting sounds to specific emotional tones or user preferences. In music production, Fugatto could help explore innovative compositions by transforming instruments or vocal styles.
However, some experts believe further development is needed for better musical results. Dennis Bathory-Kitsz, a musician and composer, criticized the voice isolation and added instruments as “clumsy and unmusical.” He expressed doubts about the quality of transformations, calling them “trivial” and “colorless.” Bathory-Kitsz warned that while Fugatto might be useful for non-musical users, it could fail to produce high-quality, musically innovative results unless its developers have stronger musical expertise.
AGI Stand-In
Though artificial general intelligence (AGI) is still a distant goal, Fugatto could serve as a model for simulating AGI, which aims to replicate or surpass human cognitive abilities across a range of tasks. Rob Enderle, president of the Enderle Group, explained that Fugatto represents a collaborative solution combining generative AI with other tools to create a more AGI-like experience. He noted that until AGI becomes a reality, this approach will be a dominant way to create more sophisticated AI projects with greater quality and interest.