Recent multi-modal models like OpenAI's gpt-4o and Google's Gemini 1.5 models can comprehend video. When feeding video into these new models, we can push frames at a set frequency (for example, one frame every second) — but this method can be wildly inefficient and expensive.
Fortunately, there is a better method called "semantic chunking." Semantic chunking is a common method used in text-based Retrieval-Augmented Generation (RAG), but we can apply the same logic to video using image embedding models. Using the similarity between these frames, we can effectively split videos based on the semantic meaning of the constituent frames.
In this video, we'll explore how to use two test videos and chunk them into semantic blocks.
📌 Code:
https://github.com/aurelio-labs/seman...
📖 Article:
https://www.aurelio.ai/learn/video-ch...
⭐ Repo:
https://github.com/aurelio-labs/seman...
👋🏼 AI Consulting:
https://aurelio.ai
👾 Discord:
/ discord
Twitter: / jamescalam
LinkedIn: / jamescalam
#ai #artificialintelligence #openai
00:00 Semantic Chunking
00:24 Video Chunking and gpt-4o
01:59 Video Chunking Code
03:28 Setting up the Vision Transformer
05:56 ViT vs. CLIP and other models
06:40 Video Chunking Results
08:37 Using CLIP for Vision Chunking
11:29 Final Conclusion on Video Processing
Смотрите видео Processing Videos for GPT-4o and Search онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь James Briggs 21 Май 2024, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 6,33 раз и оно понравилось 21 людям.