How Alibaba’s Stunning AI Video Generator Outperforms Sora

Alibaba’s new AI video generator, EMO, can turn still images into realistic and expressive actors and singers. See how it compares to OpenAI’s Sora.

EMO vs Sora: A new level of AI video generation

AI video generator
Image Source: Alibaba

Have you ever wondered what it would be like to see the Sora lady sing? Well, Alibaba has made it possible with its new AI video generator, EMO, which can turn any still image of a face into a lively and convincing performer.

The AI video generator known as EMO, short for Emotive Portrait Alive, represents a groundbreaking advancement in the field, surpassing OpenAI’s Sora, a widely recognized system renowned for its creation of photorealistic video worlds from text prompts. While Sora excels in generating breathtaking scenes and landscapes, its characters often remain silent and motionless. In contrast, EMO distinguishes itself by enabling characters to speak and sing, exhibiting realistic facial expressions and precise lip sync capabilities.

EMO’s amazing demos: From Dua Lipa to Audrey Hepburn

Alibaba, the Chinese e-commerce giant, has recently released a paper and a GitHub repository showcasing its new technology. In one of the demo videos produced by an AI video generator, EMO transforms the Sora lady, who is known for wandering around an AI-generated Tokyo after a rainstorm, into a Dua Lipa fan who sings and dances to “Don’t Start Now”.

Another demo shows how EMO can make Audrey Hepburn, the iconic actress, recite the audio from a viral clip of Riverdale’s Lili Reinhart, who confesses her love for crying. Hepburn’s face not only matches the words but also the emotions and nuances of the original speaker.

EMO is not just a simple AI face-swapping technique, like the ones that gave rise to deepfakes a few years ago. EMO does not rely on 3D models or existing videos of the target face, but only on a single still image. It can also handle different languages and accents, such as English and Korean, and produce accurate mouth movements and sounds.

How Alibaba’s AI Video Generator works

According to the paper, EMO uses a large dataset of audio and video to learn how to animate faces realistically. It also uses a diffusion-based approach, which means that it gradually transforms the still image into a video frame by frame, without any intermediate steps.

EMO also employs two attention mechanisms, one for the reference image and one for the audio, to ensure that the facial animation is consistent with both the appearance and the speech of the target face. The result is a seamless and expressive video that looks like the real person is talking or singing.

The implications of EMO: A double-edged sword

The AI video generator, EMO, exemplifies the capabilities of AI video generation, presenting exciting opportunities for various fields like entertainment, education, and communication. Imagine harnessing the power to make famous personalities or historical figures say or sing whatever you desire, or creating virtual avatars that not only resemble but also sound like you. The possibilities offered by AI video generators like EMO are truly remarkable and hold immense potential for transforming diverse industries.

However, EMO also raises some ethical and social concerns, especially regarding the privacy and consent of the people whose faces are used by the system. EMO could be used to create fake or misleading videos that could harm the reputation or credibility of the individuals involved or to manipulate the emotions or opinions of the viewers.

Therefore, EMO should be used with caution and responsibility, and with respect for the rights and dignity of the people whose faces are being animated. EMO is a remarkable innovation, but it is also a double-edged sword that could have positive or negative consequences depending on how it is used.

Share:
Comments: