Proximal Policy Optimization (PPO) - How to train Large Language Models

Опубликовано: 01 Январь 1970
на канале: Serrano.Academy
26,752
934

Reinforcement Learning with Human Feedback (RLHF) is a method used for training Large Language Models (LLMs). In the heart of RLHF lies a very powerful reinforcement learning method called Proximal Policy Optimization. Learn about it in this simple video!

This is the first one in a series of 3 videos dedicated to the reinforcement learning methods used for training LLMs.

Full Playlist:    • RLHF for training Language Models  

Video 0 (Optional): Introduction to deep reinforcement learning    • A friendly introduction to deep reinf...  
Video 1 (This one): Proximal Policy Optimization
Video 2: Reinforcement Learning with Human Feedback    • Reinforcement Learning with Human Fee...  
Video 3 (Coming soon!): Deterministic Policy Optimization

00:00 Introduction
01:25 Gridworld
03:10 States and Action
04:01 Values
07:30 Policy
09:39 Neural Networks
16:14 Training the value neural network (Gain)
22:50 Training the policy neural network (Surrogate Objective Function)
33:38 Clipping the surrogate objective function
36:49 Summary

Get the Grokking Machine Learning book!
https://manning.com/books/grokking-ma...
Discount code (40%): serranoyt
(Use the discount code on checkout)


Смотрите видео Proximal Policy Optimization (PPO) - How to train Large Language Models онлайн без регистрации, длительностью часов минут секунд в хорошем качестве. Это видео добавил пользователь Serrano.Academy 01 Январь 1970, не забудьте поделиться им ссылкой с друзьями и знакомыми, на нашем сайте его посмотрели 26,75 раз и оно понравилось 93 людям.