Python code to build your BPE - Tokenizer from scratch (w/ HuggingFace)

Published: 27 January 2022
on channel: Discover AI

4,127

Python TF2 code (JupyterLab) to train your Byte-Pair Encoding tokenizer (BPE):
a. Start with all the characters present in the training corpus as tokens.
b. Identify the most common pair of tokens and merge it into one token.
c. Repeat until the vocabulary (e.g., the number of tokens) has reached the size you want.

Training a tokenizer is not (!) the same as training a DL model. TensorFlow2 code:
from tokenizers.trainers import BpeTrainer
tokenizer.train(files, trainer)

Here the special case of a Byte-Pair Encoding (BPE) from HuggingFace's Tokenizer Library! See original downloadable models, tokenizers and datasets at: https://huggingface.co/models

#Tokenizer
#HuggingFace
#BPE

00:00 Code my optimized BPE Tokenizer
03:13 BPE model and trainer
04:36 Train a new Tokenizer
05:38 Use newly constructed Tokenizer
07:55 Encode batch
09:44 Summary

Watch video Python code to build your BPE - Tokenizer from scratch (w/ HuggingFace) online without registration, duration hours minute second in high quality. This video was added by user Discover AI 27 January 2022, don't forget to share it with your friends and acquaintances, it has been viewed on our site 4,127 once and liked it 49 people.

304