Multimodal AI: LLMs that can see (and hear)

Published: 20 November 2024
on channel: Shaw Talebi

5,362

223

🗞️ Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.co...
🧑‍🎓 Learn AI in 6 weeks by building it: https://maven.com/shaw-talebi/ai-buil...
--
Multimodal (Large) Language Models expand an LLM's text-only capabilities to include other modalities. Here are three ways to do this.

Resources:
📰 Blog: https://towardsdatascience.com/multim...
▶️ LLM Playlist: • Fine-tuning Large Language Models (LL...
💻 GitHub Repo: https://github.com/ShawhinT/YouTube-B...

References:
[1] Multimodal Machine Learning: https://arxiv.org/abs/1705.09406
[2] A Survey on Multimodal Large Language Models: https://arxiv.org/abs/2306.13549
[3] Visual Instruction Tuning: https://arxiv.org/abs/2304.08485
[4] GPT-4o System Card: https://arxiv.org/abs/2410.21276
[5] Janus: https://arxiv.org/abs/2410.13848
[6] Learning Transferable Visual Models From Natural Language Supervision: https://arxiv.org/abs/2103.00020
[7] Flamingo: https://arxiv.org/abs/2204.14198
[8] Mini-Omni2: https://arxiv.org/abs/2410.11190
[9] Emu3: https://arxiv.org/abs/2409.18869
[10] Chameleon: https://arxiv.org/abs/2405.09818

--
Homepage: https://www.shawhintalebi.com

Introduction - 0:00
Multimodal LLMs - 1:49
Path 1: LLM + Tools - 4:24
Path 2: LLM + Adapaters - 7:20
Path 3: Unified Models - 11:19
Example: LLaMA 3.2 for Vision Tasks (Ollama) - 13:24
What's next? - 19:58

Watch video Multimodal AI: LLMs that can see (and hear) online without registration, duration hours minute second in high quality. This video was added by user Shaw Talebi 20 November 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 5,362 once and liked it 223 people.

45,540

677