LLM Tokenizers Explained: BPE Encoding, WordPiece and SentencePiece

Published: 03 March 2024
on channel: DataMListic
6,511
241

In this video we talk about three tokenizers that are commonly used when training large language models: (1) the byte-pair encoding tokenizer, (2) the wordpiece tokenizer and (3) the sentencepiece tokenizer.

References
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
BPE tokenizer paper: https://arxiv.org/abs/1508.07909
WordPiece tokenizer paper:
Wordpiece tokenizer paper: https://static.googleusercontent.com/...
Sentencepiece tokenizer paper: https://arxiv.org/abs/1808.06226

Related Videos
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
Why Language Models Hallucinate:    • Why Language Models Hallucinate  
Grounding DINO, Open-Set Object Detection:    • Object Detection Part 8: Grounding DI...  
Detection Transformers (DETR), Object Queries:    • Object Detection Part 7: Detection Tr...  
Wav2vec2 A Framework for Self-Supervised Learning of Speech Representations - Paper Explained:    • Wav2vec2 A Framework for Self-Supervi...  
Transformer Self-Attention Mechanism Explained:    • Transformer Self-Attention Mechanism ...  
How to Fine-tune Large Language Models Like ChatGPT with Low-Rank Adaptation (LoRA):    • How to Fine-tune Large Language Model...  
Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped Query Attention (GQA) Explained:    • Multi-Head Attention (MHA), Multi-Que...  
LLM Prompt Engineering with Random Sampling: Temperature, Top-k, Top-p:    • LLM Prompt Engineering with Random Sa...  

Contents
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
00:00 - Intro
00:32 - BPE Encoding
02:16 - Wordpiece
03:45 - Sentencepiece
04:52 - Outro

Follow Me
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
🐦 Twitter: @datamlistic   / datamlistic  
📸 Instagram: @datamlistic   / datamlistic  
📱 TikTok: @datamlistic   / datamlistic  

Channel Support
▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬
The best way to support the channel is to share the content. ;)

If you'd like to also support the channel financially, donating the price of a coffee is always warmly welcomed! (completely optional and voluntary)
► Patreon:   / datamlistic  
► Bitcoin (BTC): 3C6Pkzyb5CjAUYrJxmpCaaNPVRgRVxxyTq
► Ethereum (ETH): 0x9Ac4eB94386C3e02b96599C05B7a8C71773c9281
► Cardano (ADA): addr1v95rfxlslfzkvd8sr3exkh7st4qmgj4ywf5zcaxgqgdyunsj5juw5
► Tether (USDT): 0xeC261d9b2EE4B6997a6a424067af165BAA4afE1a

#tokenization #llm #wordpiece #sentencepiece


Watch video LLM Tokenizers Explained: BPE Encoding, WordPiece and SentencePiece online without registration, duration hours minute second in high quality. This video was added by user DataMListic 03 March 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 6,511 once and liked it 241 people.