Recycle-GAN: Unsupervised Video Retargeting

Aayush Bansal

Shugao Ma

Deva Ramanan

Yaser Sheikh

[Paper]

[Code]

Our approach for video retargeting used for faces and flowers. The top row shows translation from John Oliver to Stephen Colbert. The bottom row shows how a synthesized flower follows the blooming process with the input flower.

Abstract

We introduce a data-driven approach for unsupervised video retargeting that translates content from one domain to another while preserving the style native to a domain, i.e., if contents of John Oliver's speech were to be transferred to Stephen Colbert, then the generated content/speech should be in Stephen Colbert's style. Our approach combines both spatial and temporal information along with adversarial losses for content translation and style preservation. In this work, we first study the advantages of using spatiotemporal constraints over spatial constraints for effective retargeting. We then demonstrate the proposed approach for the problems where information in both space and time matters such as face-to-face translation, flower-to-flower, wind and cloud synthesis, sunrise and sunset.

A. Bansal, S. Ma, D. Ramanan, Y. Sheikh
Recycle-GAN: Unsupervised Video Retargeting.
In ECCV, 2018.

[Bibtex]

Resurrecting the past

We extended Recycle-GAN to use it to generate hi-res videos using 2-3 seconds long low-res video clips of celebrities from past. Here is an example of Winston Churchill narrating the famous speech delivered on June 04, 1940 in British Parliament.

Thanks to friends at BBC Studios London for providing the data and all the help in creating this video.

Face-to-Face

We use the publicly available videos of various public figures for the face-to-face translation task. We show example videos of face-to-face translation for public figures such as Martin Luther King Jr. (MLK), Barack Obama, John Oliver, Stephen Colbert.

One minute video by WQED.

Generated video (512x512 resolution) of US President Donald Trump.

Edited background of the above video used in the Not-the-white-house-correspondent-dinner.

Another video (512x512 resolution) of US President Donald Trump.

Above is the video from the show.

The above video (512x512) shows facial retargeting from a French Journalist (francetv.fr) to the French President Emmanuel Macron. The best part of the video are the last 5 seconds. Enjoy the video in HD, full screen mode on your laptop!
Thanks to Louis Milanodupont from francetv.fr for sharing the data.

The above video (256x256) shows translation from John Oliver to Stephen Colbert.

Recreated a full 15 minute show of Last Week Tonight featuring John Oliver and Stephen Colbert using the publicly available Recycle-GAN code. No cherry picking! A 320x320 resolution was generated using a relatively shallower model, hence some jumps here and there.

The above video shows a face to face translation from Martin Luther King Jr. (MLK) to Barack Obama.

The above video shows a face to face translation from Martin Luther King Jr. (MLK) to Barack Obama. This is similar to previous video but with audio.

The above video (320x320) shows translation from John Oliver to Stephen Colbert.

The above video shows random examples of face-to-face translation using our approach for various public figures and characters. Without any input alignment or manual supervision, our approach could capture stylistic expressions for these public figures. As an example, John Oliver's dimple while smiling, the shape of mouth characteristic for Donald Trump, and the facial mouth lines and smile of Stephen Colbert etc.

Additional example of Marilyn Monroe:

Thanks to friends at NBC News for help in creating this video.

We show a few comparison of our approach with Cycle-GAN for face retargeting.

Body-to-Body

The above examples are specifically focussed on faces. Here we use the same approach for the body retargeting.

The above video (256x256) shows translation from MLK to Barack Obama. The fine details on faces (such as mouth movement) are missing due to low-res input-output. Generating hi-res outputs can potentially enable us to generate such fine details.

Flower-to-Flower

Extending from faces and other traditional translations, we demonstrate our approach for flowers. We use various flowers, and extracted their time-lapse from publicly available videos. The time-lapses show the blooming of different flowers but without any sync. We use our approach to align the content, i.e. both flowers bloom or die together. We show a few example videos for different flowers here.

We show a comparison of our approach with Cycle-GAN using the Dandelion flower. Our approach could learn appropriate correspondences in two domains.

Video Manipulation via Retargeting

We show our approach for automatic video manipulation via video retargeting for two cases: (1). Synthesizing clouds and winds in videos; (2). making sunrise and sunset in different videos.

Clouds & Wind Synthesis

Our approach can be used to synthesize a new video that has the required environmental condition such as clouds and wind without the need for physical efforts of recapturing. We use the given video and video data from required environmental condition as two domains in our experiment. The conditional video and trained translation model is then used to generate a required output.

For this experiment, we collected the video data for various wind and cloud conditions, such as calm day or windy day. Using our approach, we can convert a calm-day to a windy-day, and a windy-day to a calm-day, without modifying the aesthetics of the place.

The above video shows our attempt to simulate environmental conditions (clouds and winds on a windy day, or still clouds). Conditioning it for a required setting, we can synthesize the condition without changing the aesthetics of the place.

The above video shows our attempt to simulate environmental conditions (clouds and winds on a light breeze, or still clouds). Conditioning it for a required setting, we can synthesize the condition without changing the aesthetics of the place.

Sunrise & Sunset

As humans, we have a tendency to align abstract concepts, and to think/imagine of how something would look like if we were at some other location than the place where we are currently making an observation. E.g., a person might be seeing a sunset in New York on the shores of Atlantic Ocean, and may start imaging how a sunset would look like in California around Pacific; or a person might be roaming in the streets of Pittsburgh and may start to imagine what it would be like roaming in the streets of Paris.

Inspired by this thought process, we extracted the sunrise and sunset data from various web videos, and show how our approach could be used for both video manipulation and content alignment. This is similar to settings in our experiments on clouds and wind synthesis.

The above video shows our attempt to synthesize a sunrise at a given place.

The above video shows our attempt to align the content of sunset at two different locations.

Origami-bird learning from Real-bird

We took the raw origami bird video from Kholgade et al.

human and robot

Made an attempt at learning association between a human and a robot pickup (dataset from Sharma et al).

a summary video created by CMU media folks

Transferring One Video Into the Style of Another).

Acknowledgements

We thank the authors of Cycle-GAN and Pix2Pix, and OpenPose for their work. It is because of them, this work could be possible. Inspired from Cycle-GAN, we name our approach Recycle-GAN. We thank the larger community that collected and uploaded the videos on web. It is because of their efforts, we could do this academic research work. Many thanks to our friends and colleagues at CMU and Facebook for many discussions and suggestions. This work could only cover a tiny part of those wonderful suggestions and ideas. Thanks to Bryan Russell, Chia-Yin Tsai, Aravindh Mahendran, Tomas Pfister, Zhe Cao, Martin Li, and Siva Chaitanya for various suggestions on text and videos. Finally, we thank the authors of Colorful Image Colorization for this webpage design.