Meta's new Make-a-Video AI can generate quick movie clips from text prompts
It's like Dall-E 2, but more animated.
Meta unveiled its Make-a-Scene text-to-image generation AI in July, which like Dall-E and Midjourney, utilizes machine learning algorithms (and massive databases of scraped online artwork) to create fantastical depictions of written prompts. On Thursday, Meta CEO Mark Zuckerberg revealed Make-a-Scene's more animated contemporary, Make-a-Video.
As its name implies, Make-a-Video is, "a new AI system that lets people turn text prompts into brief, high-quality video clips," Zuckerberg wrote in a Meta blog Thursday. Functionally, Video works the same way that Scene does — relying on a mix of natural language processing and generative neural networks to convert non-visual prompts into images — it's just pulling content in a different format.
"Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage," a team of Meta researchers wrote in a research paper published Thursday morning. Doing so enabled the team to reduce the amount of time needed to train the Video model and eliminate the need for paired text-video data, while preserving "the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models."
As with most all of Meta's AI research, Make-a-Video is being released as an open-source project. "We want to be thoughtful about how we build new generative AI systems like this," Zuckerberg noted. "We are openly sharing this generative AI research and results with the community for their feedback, and will continue to use our responsible AI framework to refine and evolve our approach to this emerging technology."
As with seemingly every generative AI that is released, the opportunity for misuse of Make-a-Video is not a small one. To get ahead of any potential nefarious shenanigans, the research team preemptively scrubbed the Make-a-Video training dataset of any NSFW imagery as well as toxic phrasing.