Robot Synesthesia: A Sound and Emotion Guided AI Painter

Best Entertainment and Amusement Paper Award

Vihaan Misra Peter Schaldenbrand Jean Oh

IROS 2024

[Paper] | [Code]

Abstract

If a picture paints a thousand words, sound may voice a million. While recent robotic painting and image synthesis methods have achieved progress in generating visuals from text inputs, the translation of sound into images is vastly unexplored. Generally, sound-based interfaces and sonic interactions have the potential to expand accessibility and control for the user and provide a means to convey complex emotions and the dynamic aspects of the real world. In this paper, we propose an approach for using sound and speech to guide a robotic painting process, known here as robot synesthesia. For general sound, we encode the simulated paintings and input sounds into the same latent space. For speech, we decouple speech into its transcribed text and the tone of the speech. Whereas we use the text to control the content, we estimate the emotions from the tone to guide the mood of the painting. Our approach has been fully integrated with FRIDA, a robotic painting framework, adding sound and speech to FRIDA's existing input modalities, such as text and style. In two surveys, participants were able to correctly guess the emotion or natural sound used to generate a given painting more than twice as likely as random chance. On our sound-guided image manipulation and music-guided paintings, we discuss the results qualitatively.

**Robot Synesthesia** connects a robot painter’s action space directly to user driven sonic interactions. For Speech guidance, sound is decoupled into language and emotion whereas with Natural sound guidance, the audio itself drives the content of the painting.

Robot Synesthesia Overview

Robot Synesthesia Overview: A human user’s artistic intentions are specified via any combination of natural sounds, or speech, or existing modalities like sketch, style and text. Brush-stroke actions are rendered into a simulated painting, then features are extracted and compared to the input features to form loss functions. The loss is backpropagtated and gradient descent updates the actions to decrease the loss. After optimization, the brush stroke actions are executed by a robotic arm

Methodology

Methodology: Our methodology involves encoding the input sounds and speech into a shared latent space with the simulated paintings. For general sounds, we extract features from the audio signals and align them with visual features. For speech, we decouple the content and emotional tone, using the text for content guidance and the tone for emotional modulation. The brush stroke actions are optimized using gradient descent to match the intended artistic expression derived from the sound inputs.

Implementation

Implementation: Our implementation integrates with the FRIDA framework, which allows for real-time robotic painting. The system takes audio inputs, processes them through our custom models to generate feature embeddings, and then translates these features into brush stroke actions. We use state-of-the-art techniques for feature extraction and emotion recognition to ensure that the generated paintings accurately reflect the input sounds and emotions. FRIDA’s base model introduces 4 other loss functions,l_i, that connect the paintings to modalities text, images, sketches, and styles.

Results

Examples showing how emotion included in the inputs impacts the overall impressions of the paintings. The figure shows the inputs (first row), and the real paintings drawn by S-FRIDA (second row)

Painting various examples from the VGG-Sound categorized as “female singing” using audio versus using a description of the audio. A video still is shown here for clarity but is not used in image generation.

An image grid depicting the results of emotion guidance with varying guidance strengths. From left to right, each column showcases the progression of the synthesized paintings, with increasing levels of emotion guidance strength, as indicated by the numbers.

Survey Results

Our study demonstrates that natural sounds and emotional contexts can be effectively used to generate paintings that align with the semantic and emotional aspects of the input. Using simulated paintings for evaluation, we verified that our system, S-FRIDA, produces paintings closely resembling real-world outputs thanks to the Real2Sim2Real technique. For natural sound guidance, participants were able to match paintings with their corresponding audio input 43.3% of the time, significantly better than the 16.7% expected by random chance. This result reflects the system's ability to capture the essence of various natural sounds, though some sounds like thunder, thunderstorm, and rain were frequently confused. In terms of emotion guidance, our approach, using the ArtEmis dataset, allowed users to correctly identify the emotion behind generated paintings 26.5% of the time, compared to the 12.5% expected by chance. Notably, the emotion "Awe" was identified correctly 60% of the time, while "Amusement" was often mistaken for "Excitement." These results underscore the system's effectiveness in translating both auditory and emotional cues into meaningful visual representations.

Images Used

First row images: Painting with only emotion guidance. Second row: painting with both emotion and text “A house and a tree.”

Paintings generated using natural sounds as input.

Conclusion

Conclusion: Our approach, Robot Synesthesia, adds sound inputs to the FRIDA robotic painting platform. When compared to existing sound-guided image synthesis approaches, ours is more general as it does not rely on constrained pre-trained image generators, e.g., Style-GAN. In addition, we treat natural sounds and speech separately to unlock the important content and emotion held in spoken language. While many existing works focus on robots following commands, Robot Synesthesia is a rare attempt at a robotic system which can hear a human user, understand their emotions, and assist them to express their ideas in visual art. By supporting audio-based interaction, our work contributes to making a robotic painting system accessible to broader user groups.

More Results

Painting using various Pop songs as input. Genres of the songs from left to right are Disco, Traditional American Folk, and Rap/Hip-Hop

Combining audio as an input modality with other modalities that FRIDA can handle.

Paper

ArXiv

Citation

Vihaan Misra, Peter Schaldenbrand, & Jean Oh. "Robot Synesthesia: A Sound and Emotion Guided AI Painter." 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems.

BibTex