IROS 2024
If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.
Robot Synesthesia Overview: A human user’s artistic intentions are specified via any combination of natural sounds, or speech, or existing modalities like sketch, style and text. Brush-stroke actions are rendered into a simulated painting, then features are extracted and compared to the input features to form loss functions. The loss is backpropagtated and gradient descent updates the actions to decrease the loss. After optimization, the brush stroke actions are executed by a robotic arm
Methodology: Our methodology involves encoding the input sounds and speech into a shared latent space with the simulated paintings. For general sounds, we extract features from the audio signals and align them with visual features. For speech, we decouple the content and emotional tone, using the text for content guidance and the tone for emotional modulation. The brush stroke actions are optimized using gradient descent to match the intended artistic expression derived from the sound inputs.
Implementation: Our implementation integrates with the FRIDA framework, which allows for real-time robotic painting. The system takes audio inputs, processes them through our custom models to generate feature embeddings, and then translates these features into brush stroke actions. We use state-of-the-art techniques for feature extraction and emotion recognition to ensure that the generated paintings accurately reflect the input sounds and emotions. FRIDA’s base model introduces 4 other loss functions,li, that connect the paintings to modalities text, images, sketches, and styles.
Our study demonstrates that natural sounds and emotional contexts can be effectively used to generate paintings that align with the semantic and emotional aspects of the input. Using simulated paintings for evaluation, we verified that our system, S-FRIDA, produces paintings closely resembling real-world outputs thanks to the Real2Sim2Real technique. For natural sound guidance, participants were able to match paintings with their corresponding audio input 43.3% of the time, significantly better than the 16.7% expected by random chance. This result reflects the system's ability to capture the essence of various natural sounds, though some sounds like thunder, thunderstorm, and rain were frequently confused. In terms of emotion guidance, our approach, using the ArtEmis dataset, allowed users to correctly identify the emotion behind generated paintings 26.5% of the time, compared to the 12.5% expected by chance. Notably, the emotion "Awe" was identified correctly 60% of the time, while "Amusement" was often mistaken for "Excitement." These results underscore the system's effectiveness in translating both auditory and emotional cues into meaningful visual representations.
Conclusion: Our approach, Robot Synesthesia, adds sound inputs to the FRIDA robotic painting platform. When compared to existing sound-guided image synthesis approaches, ours is more general as it does not rely on constrained pre-trained image generators, e.g., Style-GAN. In addition, we treat natural sounds and speech separately to unlock the important content and emotion held in spoken language. While many existing works focus on robots following commands, Robot Synesthesia is a rare attempt at a robotic system which can hear a human user, understand their emotions, and assist them to express their ideas in visual art. By supporting audio-based interaction, our work contributes to making a robotic painting system accessible to broader user groups.