Video Capability for Google's Gemini 2.5

Google has done it (again!). The Gemini 2.5 model is truly impressive, especially with its multimodal capability. Its ability to understand audio and video content is amazing—truly groundbreaking. This is probably due to the vast dataset provided by YouTube, which has likely been instrumental in its training.

Multimodal Mastery

Gemini 2.5 combines visual context, audio context, and text recognition all in one. It has a 1 million token limit and knowledge cutoff of January 2025. One million tokens is approximately equivalent to one hour of video content. While longer videos won't work, most videos will fit within this range to showcase the model's capability.

Gemini 2.5

7 Killer Use Cases

I spent some time experimenting with Gemini 2.5, and its reasoning abilities blew me away. Here are five standout use cases that showcase its potential:

1. Counting Occurrences in Video

In one experiment, I tested Gemini 2.5 with a video of an assassination attempt on then-candidate Donald Trump. Could the model accurately count the number of shots fired? This task might sound trivial, but earlier AI models often struggled with simple counting tasks (like identifying the number of "R"s in the word "strawberry").

Gemini 2.5 nailed it! It correctly identified each sound, outputted the timestamps where they appeared, and counted eight shots, providing both visual and audio analysis to back up its answer. This demonstrates not only its ability to process multimodal inputs but also its capacity for precise reasoning—a major leap forward for AI systems.

2. Identifying Background Music and Movie Name

Have you ever heard a song playing in the background of a video and wished you could identify it? Gemini 2.5 can do just that! Acting like an advanced version of Shazam, it analyzes audio tracks embedded in videos and identifies background music. I am also not a big fan of people posting shorts without specifying the movie name. Gemini 2.5 solves that problem for you - no more searching for movie name!

3. OCR Text Recognition

Gemini 2.5 excels at Optical Character Recognition (OCR), making it capable of extracting text from images or videos with precision. I asked the model to output one of Khan Academy's handwritten visuals into a nice table format - and the text was precisely copied from video into a neat little table!

4. Listen to Foreign News Media

The model can translate text from one language to another and give a good translation. I tested the recent official statement from Thai officials about an earthquake in Bangkok, and the latest news from a Marathi news channel. The model was correctly able to translate and output the news synopsis in the language of your choice.

5. Cricket Analysis

The model can analyze cricket videos and provide detailed insights about player performance, shot selection, and game strategy. I tested it with a video of Virat Kohli's innings, and it could break down his technique, identify the types of shots played, and even suggest areas for improvement.

6. Emotion Detection

Gemini 2.5 can analyze facial expressions and voice tone to detect emotions in videos. This is particularly useful for content creators who want to understand audience engagement or for researchers studying human behavior.

7. Video Summarization

The model can generate concise summaries of long videos, making it easier to get the key points without watching the entire content. This is particularly useful for educational videos, lectures, and lengthy presentations.

8. Webinar Slide Generation

Now this blew my mind - video webinars are generated by slide decks and a person talking about the slides. Can we reverse the process? Given a video, can we ask AI to output the slide deck? Google Gemini 2.5 outputted 41 slides for a Stanford webinar!

Bonus: Humor Test

Finally, I put Gemini 2.5 through a humor test using a PG-13 joke from one of my favorite YouTube channels, Mike and Joelle. I wanted to see if the model could understand adult humor and infer punchlines.

At first, the model hesitated to spell out the punchline (perhaps trying to stay appropriate?), but eventually, it got there—and yes, it understood the joke perfectly!

Final Thoughts

Gemini 2.5 made a few startups' vision obsolete by pushing the bar way higher. We are living in exciting times!

References

YouTube videos used in testing:

Trump Assassination Attempt - Used for counting shots fired
Amy Cuddy's TED Talk - Used for slide generation
Kohli's Innings - Used for cricket analysis
ICC T20 WORLD CUP 2024 - Used for score and count fours and sixes
Stanford Webinar - Used for reverse engineering the slide deck
Khan Academy - Used for OCR text recognition
Mike and Joelle - Used for humor test

Follow VideoToBe.com's page!