CQTAI
7/7/2025
In the past, the release of Sora drove a qualitative leap in AI video quality, making the physical logic in videos more realistic and completely igniting this field. Startups like Runway, Pika, Luma, Kling, Genmo, Higgsfield, Lightricks, as well as giants including OpenAI, Google, Alibaba, and ByteDance have all joined the race.
However, no matter how much progress has been made in image quality and camera work, AI videos still suffered from being "mute" - you could see characters running, jumping, or even performing slow-motion actions, but getting them to speak, hear ambient sounds or the sizzling of a frying pan? Sorry, post-production dubbing was still required.
Moreover, audio post-production often fell out of sync - mismatched lip movements, unsynchronized dialogue, and sound effects missing their cues - ultimately leaving the final product lacking in atmosphere.
On May 21, Google officially launched Veo 3, and AI videos can finally "speak"! This new model not only generates HD visuals but also automatically synthesizes dialogue and sound effects based on the video's original pixel content, perfectly synchronized with the footage.
With a simple prompt, it instantly produces visuals + dialogue + lip-sync + sound effects - all in one go. For example, check out this performance of "We can speak now!" 👇
It can even handle complex rap segments. A simple prompt like "an elderly man discussing the universe" produces results where lip movements, rhythm, and facial expressions are all naturally connected, making it hard to distinguish from reality.
At the launch event, DeepMind CEO Demis Hassabis excitedly announced: "The era of silent AI videos is finally over! Users only need to describe characters, scenes, dialogue and tone in natural language to generate complete customized videos."
Judging from Google's official demo, Veo 3's audio-visual integration has reached near cinematic production standards. It's currently available to Google AI Ultra subscribers within the Gemini app, and enterprise users can also access it via the Vertex AI platform.
Right after the launch, netizens worldwide went wild -
Rap hits, viral videos, cooking shows took turns Users unleashed their creativity with many interesting works 👇
Creative Example 1: 👉 Prompt (Chinese translation): Two pancakes conversing while baking, the first says: "I can't believe Veo 3 can make pancakes talk now!" The second exclaims: "Wow, a talking pancake!"
Result: The pancakes not only had expressive dialogue but also perfectly synchronized lip movements.
Creative Example 2: A retro 1980s TV cooking show featuring a 65-year-old British hostess kneading dough while saying: "This is hard work..." Then the dough lifts its face and replies in a Brooklyn accent: "Hey lady, watch it, I'm trying to rise here!" Complete with authentic VHS tape texture.
Creative Example 3: Users also created futuristic Russian Techno singer viral hits, with even complex tongue-rolling sounds smoothly reproduced.
Additionally, Google's Chief Creative Technologist personally tested Veo 3's long video generation capability, using the first/last frame control feature to produce a 1+ minute narrative short. While background music needed manual addition, the dialogue and sound effects generated by Veo 3 achieved remarkably high completeness.
Pros and cons 👇
Veo 3 Quick Start Tutorial Want to try it yourself? It's simple 👇