Will Sora disrupt Hollywood?

Text-to-video generation tools, like Sora, hold immense potential, but...

Feb 26, 2024

Introduction

As the year 2022 was ending, OpenAI took the world by storm with ChatGPT and its revolutionary text generation capabilities. Now, in 2024, OpenAI is doing it again with Sora and its revolutionary video generation capabilities. Will Sora disrupt Hollywood as Tyler Perry warns? Based on what we have seen from tools like Midjourney and Sora, it is, unmistakably, clear that video generation tools have immense potential to disrupt movie production; but the current generation of these tools have limitations that make them less useful in movie production workflows.

Diffusion models and their limitations

Video generation tools, like Sora, utilize a diffusion model for various tasks, including generating and editing videos. While diffusion models can achieve impressive results in generating and manipulating visual content, they do struggle with making edits to generated videos. One potential struggle with editing generated videos using a diffusion model like Sora could be related to the fine-tuning and specificity of edits. Diffusion models operate by iteratively updating the pixels in an image or video frame based on learned patterns from training data. While this can produce high-quality results, it may also pose challenges when users want to make precise or specific edits that deviate significantly from the training data.

For example, if a user wants to remove an object from a generated video, the diffusion model might struggle if the object's appearance or context is significantly different from what it has learned during training. Similarly, making complex edits such as altering scene composition or adding new elements may require additional manual intervention or post-processing to achieve the desired result.

There are a few key reasons why diffusion models can struggle with making precise changes to existing images:

Training Process: Diffusion models are trained by starting with a clean image (often filled with noise) and gradually "denoising" it, step-by-step, until it resembles a realistic image. This process essentially teaches the model how to build an image from scratch, not manipulate existing ones.
Lack of Compositionality: Diffusion models often lack "compositionality," meaning they struggle to understand how individual elements in an image relate to each other and how changing them might affect the overall image. They excel at generating complete images based on text prompts but struggle with modifying specific elements within an existing image.
Stochasticity: Diffusion models rely on randomness during the generation process. This can lead to significant variations in the output, even for the same prompt. This makes it difficult to control specific changes and can lead to unintended alterations.
Technical Challenges: Techniques that allow for finer control over specific image elements within diffusion models are still under development. These methods often require additional training stages or specialized architectures, adding complexity to the model.

Conclusion

Researchers are actively exploring ways to improve diffusion models or develop alternative approaches for video editing. This includes methods like inpainting for videos, which involves filling specific areas with content consistent with the surroundings, or techniques focusing on directly manipulating existing video elements.

While current generation of video generation tools may not be ideal for precise video edits, their ability to generate videos from text remains a significant advancement. As technology progresses, we can expect better tools for both video creation and editing, potentially overcoming the limitations of current diffusion models.

Mahesh’s Substack

Discussion about this post