Processing Videos for GPT-4o and Search

Published: 21 May 2024
on channel: James Briggs
6,333
211

Recent multi-modal models like OpenAI's gpt-4o and Google's Gemini 1.5 models can comprehend video. When feeding video into these new models, we can push frames at a set frequency (for example, one frame every second) — but this method can be wildly inefficient and expensive.

Fortunately, there is a better method called "semantic chunking." Semantic chunking is a common method used in text-based Retrieval-Augmented Generation (RAG), but we can apply the same logic to video using image embedding models. Using the similarity between these frames, we can effectively split videos based on the semantic meaning of the constituent frames.

In this video, we'll explore how to use two test videos and chunk them into semantic blocks.

📌 Code:
https://github.com/aurelio-labs/seman...

📖 Article:
https://www.aurelio.ai/learn/video-ch...

⭐ Repo:
https://github.com/aurelio-labs/seman...

👋🏼 AI Consulting:
https://aurelio.ai

👾 Discord:
  / discord  

Twitter:   / jamescalam  
LinkedIn:   / jamescalam  

#ai #artificialintelligence #openai

00:00 Semantic Chunking
00:24 Video Chunking and gpt-4o
01:59 Video Chunking Code
03:28 Setting up the Vision Transformer
05:56 ViT vs. CLIP and other models
06:40 Video Chunking Results
08:37 Using CLIP for Vision Chunking
11:29 Final Conclusion on Video Processing


Watch video Processing Videos for GPT-4o and Search online without registration, duration hours minute second in high quality. This video was added by user James Briggs 21 May 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 6,333 once and liked it 211 people.